124,99 €
Praise for the First Edition
" . . . the book is a valuable addition to the literature in the field, serving as a much-needed guide for both clinicians and advanced students."—Zentralblatt MATH
A new edition of the cutting-edge guide to diagnostic tests in medical research
In recent years, a considerable amount of research has focused on evolving methods for designing and analyzing diagnostic accuracy studies. Statistical Methods in Diagnostic Medicine, Second Edition continues to provide a comprehensive approach to the topic, guiding readers through the necessary practices for understanding these studies and generalizing the results to patient populations.
Following a basic introduction to measuring test accuracy and study design, the authors successfully define various measures of diagnostic accuracy, describe strategies for designing diagnostic accuracy studies, and present key statistical methods for estimating and comparing test accuracy. Topics new to the Second Edition include:
Methods for tests designed to detect and locate lesions
Recommendations for covariate-adjustment
Methods for estimating and comparing predictive values and sample size calculations
Correcting techniques for verification and imperfect standard biases
Sample size calculation for multiple reader studies when pilot data are available
Updated meta-analysis methods, now incorporating random effects
Three case studies thoroughly showcase some of the questions and statistical issues that arise in diagnostic medicine, with all associated data provided in detailed appendices. A related web site features Fortran, SAS®, and R software packages so that readers can conduct their own analyses.
Statistical Methods in Diagnostic Medicine, Second Edition is an excellent supplement for biostatistics courses at the graduate level. It also serves as a valuable reference for clinicians and researchers working in the fields of medicine, epidemiology, and biostatistics.
Sie lesen das E-Book in den Legimi-Apps auf:
Seitenzahl: 925
Veröffentlichungsjahr: 2014
CONTENTS
LIST OF FIGURES
LIST OF TABLES
0.1 PREFACE
0.2 ACKNOWLEDGEMENTS
PART I: BASIC CONCEPTS AND METHODS
CHAPTER 1: INTRODUCTION
1.1 DIAGNOSTIC TEST ACCURACY STUDIES
1.2 CASE STUDIES
1.3 SOFTWARE
1.4 TOPICS NOT COVERED IN THIS BOOK
CHAPTER 2: MEASURES OF DIAGNOSTIC ACCURACY
2.1 SENSITIVITY AND SPECIFICITY
2.2 COMBINED MEASURES OF SENSITIVITY AND SPECIFICITY
2.3 RECEIVER OPERATING CHARACTERISTIC (ROC) CURVE
2.4 AREA UNDER THE ROC CURVE
2.5 SENSITIVITY AT FIXED FPR
2.6 PARTIAL AREA UNDER THE ROC CURVE
2.7 LIKELIHOOD RATIOS
2.8 ROC ANALYSIS WHEN THE TRUE DIAGNOSIS IS NOT BINARY
2.9 C-STATISTICS AND OTHER MEASURES TO COMPARE PREDICTION MODELS
2.10 DETECTION AND LOCALIZATION OF MULTIPLE LESIONS
2.11 POSITIVE AND NEGATIVE PREDICTIVE VALUES, BAYES THEOREM, AND CASE STUDY 2
2.12 OPTIMAL DECISION THRESHOLD ON THE ROC CURVE
2.13 INTERPRETING THE RESULTS OF MULTIPLE TESTS
CHAPTER 3: DESIGN OF DIAGNOSTIC ACCURACY STUDIES
3.1 ESTABLISH THE OBJECTIVE OF THE STUDY
3.2 IDENTIFY THE TARGET PATIENT POPULATION
3.3 SELECT A SAMPLING PLAN FOR PATIENTS
3.4 SELECT THE GOLD STANDARD
3.5 CHOOSE A MEASURE OF ACCURACY
3.6 IDENTIFY TARGET READER POPULATION
3.7 SELECT SAMPLING PLAN FOR READERS
3.8 PLAN DATA COLLECTION
3.9 PLAN DATA ANALYSES
3.10 DETERMINE SAMPLE SIZE
CHAPTER 4: ESTIMATION AND HYPOTHESIS TESTING IN A SINGLE SAMPLE
4.1 BINARY-SCALE DATA
4.2 ORDINAL-SCALE DATA
4.3 CONTINUOUS-SCALE DATA
4.4 TESTING THE HYPOTHESIS THAT THE ROC CURVE AREA OR PARTIAL AREA IS A SPECIFIC VALUE
CHAPTER 5: COMPARING THE ACCURACY OF TWO DIAGNOSTIC TESTS
5.1 BINARY-SCALE DATA
5.2 ORDINAL- AND CONTINUOUS-SCALE DATA
5.3 TESTS OF EQUIVALENCE
CHAPTER 6: SAMPLE SIZE CALCULATIONS
6.1 STUDIES ESTIMATING THE ACCURACY OF A SINGLE TEST
6.2 SAMPLE SIZE FOR DETECTING A DIFFERENCE IN ACCURACIES OF TWO TESTS
6.3 SAMPLE SIZE FOR ASSESSING NON-INFERIORITY OR EQUIVALENCY OF TWO TESTS
6.4 SAMPLE SIZE FOR DETERMINING A SUITABLE CUTOFF VALUE
6.5 SAMPLE SIZE DETERMINATION FOR MULTI-READER STUDIES
6.6 ALTERNATIVE TO SAMPLE SIZE FORMULAE
CHAPTER 7: INTRODUCTION TO META-ANALYSIS FOR DIAGNOSTIC ACCURACY STUDIES
7.1 OBJECTIVES
7.2 RETRIEVAL OF THE LITERATURE
7.3 INCLUSION/EXCLUSION CRITERIA
7.4 EXTRACTING INFORMATION FROM THE LITERATURE
7.5 STATISTICAL ANALYSIS
7.6 PUBLIC PRESENTATION
PART II: ADVANCED METHODS
CHAPTER 8: REGRESSION ANALYSIS FOR INDEPENDENT ROC DATA
8.1 FOUR CLINICAL STUDIES
8.2 REGRESSION MODELS FOR CONTINUOUS-SCALE TESTS
8.3 REGRESSION MODELS FOR ORDINAL-SCALE TESTS
8.4 COVARIATE ADJUSTED ROC CURVES OF CONTINUOUS-SCALE TESTS
CHAPTER 9: ANALYSIS OF MULTIPLE READER AND/OR MULTIPLE TEST STUDIES
9.1 STUDIES COMPARING MULTIPLE TESTS WITH COVARIATES
9.2 STUDIES WITH MULTIPLE READERS AND MULTIPLE TESTS
9.3 ANALYSIS OF MULTIPLE TESTS DESIGNED TO LOCATE AND DIAGNOSE LESIONS
CHAPTER 10: METHODS FOR CORRECTING VERIFICATION BIAS
10.1 EXAMPLES
10.2 IMPACT OF VERIFICATION BIAS
10.3 A SINGLE BINARY-SCALE TEST
10.4 CORRELATED BINARY-SCALE TESTS
10.5 A SINGLE ORDINAL-SCALE TEST
10.6 CORRELATED ORDINAL-SCALE TESTS
10.7 CONTINUOUS-SCALE TESTS
CHAPTER 11: METHODS FOR CORRECTING IMPERFECT GOLD STANDARD BIAS
11.1 EXAMPLES
11.2 IMPACT OF IMPERFECT GOLD STANDARD BIAS
11.3 ONE SINGLE BINARY TEST IN A SINGLE POPULATION
11.4 ONE SINGLE BINARY TEST IN G POPULATIONS
11.5 MULTIPLE BINARY TESTS IN ONE SINGLE POPULATION
11.6 MULTIPLE BINARY TESTS IN G POPULATIONS
11.7 MULTIPLE ORDINAL-SCALE TESTS IN ONE SINGLE POPULATION
11.8 MULTIPLE-SCALE TESTS IN ONE SINGLE POPULATION
CHAPTER 12: STATISTICAL ANALYSIS FOR META-ANALYSIS
12.1 BINARY-SCALE DATA
12.2 ORDINAL- OR CONTINUOUS-SCALE DATA
12.3 ROC CURVE AREA
APPENDIX A: CASE STUDIES AND CHAPTER 8 DATA
APPENDIX B: JACKKNIFE AND BOOTSTRAP METHODS OF ESTIMATING VARIANCES AND CONFIDENCE INTERVALS
REFERENCES
INDEX
WILEY SERIES IN PROBABILITY AND STATISTICS
Established by WALTER A. SHEWHART and SAMUEL S. WILKS
Editors: David J. Balding, Noel A. C. Cressie, Garrett M. Fitzmaurice, lain M Johnstone, Geert Molenberghs, David W Scott, Adrian F. M Smith, Ruey S. Tsay, Sanford Weisberg
Editors Emeriti: Vic Barnett, J. Stuart Hunter, Joseph B. Kadane, Jozef L. Teugels
A complete list of the titles in this series appears at the end of this volume.
Copyright© 2011 by John Wiley & Sons, Inc. All rights reserved.
Published by John Wiley & Sons, Inc., Hoboken, New Jersey.
Published simultaneously in Canada.
No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical, photocopying, recording, scanning, or otherwise, except as permitted under Section 107 or 108 of the 1976 United States Copyright Act, without either the prior written permission of the Publisher, or authorization through payment of the appropriate per-copy fee to the Copyright Clearance Center, Inc., 222 Rosewood Drive, Danvers, MA 01923, (978) 750-8400, fax (978) 750-4470, or on the web at www.copyright.com. Requests to the Publisher for permission should be addressed to the Permissions Department, John Wiley & Sons, Inc., Ill River Street, Hoboken, NJ 07030, (201) 748-6011, fax (201) 748-6008, or online at http://www.wiley.com/go/permission.
Limit of Liability/Disclaimer of Warranty: While the publisher and author have used their best efforts in preparing this book, they make no representations or warranties with respect to the accuracy or completeness of the contents of this book and specifically disclaim any implied warranties of merchantability or fitness for a particular purpose. No warranty may be created or extended by sales representatives or written sales materials. The advice and strategies contained herein may not be suitable for your situation. You should consult with a professional where appropriate. Neither the publisher nor author shall be liable for any loss of profit or any other commercial damages, including but not limited to special, incidental, consequential, or other damages.
For general information on our other products and services or for technical support, please contact our Customer Care Department within the United States at (800) 762-2974, outside the United States at (317) 572-3993 or fax (317) 572-4002.
Wiley also publishes its books in a variety of electronic formats. Some content that appears in print may not be available in electronic formats. For more information about Wiley products, visit our web site at www.wiley.com.
Library of Congress Cataloging-in-Publication Data is available.
ISBN 978-0470-18314-4
This book is dedicated to Yea-Jae, Ralph, and Tom
LIST OF FIGURES
LIST OF TABLES
0.1 PREFACE
Diagnostic tests play a pivotal role in medicine, often determining what additional diagnostic tests, treatments, and interventions are needed and ultimately affecting patients' outcomes. Given the importance of this role, it is critical that clinicians are given reliable data about the accuracy of the diagnostic tests they order. These clinicians need well-designed diagnostic accuracy studies and they need to understand how the results of these studies apply to their patients. The purpose of this book, then, is two-fold: to provide a comprehensive approach to designing and analyzing diagnostic accuracy studies and to aid clinicians in understanding these studies and in generalizing study results to their patient populations.
Since the first edition, we have updated each chapter with recently published methods. These updates include new methods for tests designed to detect and locate lesions (see Chapters 2, 3, and 9), recommendations for the type of covariate-adjustment needed (Chapter 3) along with new methods for covariate-adjustment (Chapter 8), estimating and comparing predictive values (Chapters 4 and 5) and calculating sample size for studies using predictive values (Chapter 6), sample size calculation for multiple reader studies when pilot data are available (Chapter 6), new methods for correcting for verification bias in estimation of ROC curves of continuous-scale tests (Chapter 10), and new methods for correcting for imperfect standard bias in estimation of ROC curves of ordinal-scale or continuous-scale tests (Chapter 11).
We have also added three case studies: a positron emission tomography (PET) study comparing the accuracy of three tests for detecting diseased parathyroid glands, a computer-aided detection (CAD) study of colon polyps, and a magnetic resonance imaging study of atherosclerosis in the carotid arteries (see Chapter 1). The data from these case studies are provided in the Appendix and are used throughout the book as illustrations of various statistical methods.
The book is organized such that the more basic material about measures of test accuracy and study design appear first (Chapters 2 and 3, respectively), followed by chapters on statistical methods of data analysis with real data examples to illustrate these methods. Chapters 4 and 5 illustrate methods for estimating accuracy and comparing tests' accuracies under a variety of study designs. Calculating the sample size required for a study is described in Chapter 6. Chapters 7 and 12 focus on the design and analysis of meta-analyses of diagnostic test accuracy. Chapters 8 and 9 look at models of diagnostic test accuracy for various patient subgroups and for multiple-reader studies, respectively. Corrections for estimates of test accuracy in studies with verification bias and imperfect gold standards are illustrated in Chapters 10 and 11. Chapters 1-3 are accessible to readers with a basic knowledge of statistical and medical terminology. Chapters 4-7 are geared to the data analyst with basic training in biostatistics. In Chapters 8-12 we provide more detailed statistical methodology for readers with more statistical training, but the examples in these chapters are accessible to all readers. The only needed change is to add mention of the books related Web site to the Preface. The authors have prepared a Web site (http://faculty.washington.edu/azhou/books/diagnostic.html) that contains links to some useful software.
0.2 ACKNOWLEDGEMENTS
We are thankful to many colleagues for supporting us during the writing and publication of both the first (2002) and second (2011) edition of this book. Their helpful critiques and suggestions about the first edition have led to this improved second edition. Particularly, we would like to thank Danping Liu and Zheyu Wang for their helpful comments on the manuscript and their computational assistance in implementing some of methods discussed in the book. We would like to thank Dr. Thomas D. Koepsell for his helpful comments on the manuscript.
We would also like to thank our families for their understanding and encouragement. Dr. Xiao-Hua (Andrew) Zhou thanks his wife, Yea-Jae, and their children, Vanessa and Joshua. Dr. Nancy Obuchowski thanks her husband, Dr. Ralph Harvey, and their children, Thcker, Eli, and Scout. Dr. Donna McClish thanks her husband, Tom, and their daughter Amanda.
Diagnostic medicine is the process of identifying the disease, or condition, that a patient has, and ruling out conditions that the patient does not have, through assessment of the patient’s signs, symptoms, and results of various diagnostic tests. Diagnostic accuracy studies are research studies which examine the ability of diagnostic tests to discriminate between patients with and without the condition; these studies are the focus of this book.
A diagnostic test has several purposes: (1) to provide reliable information about the patient’s condition, (2) to influence the health care provider’s plan for managing the patient (Sox et al., 1989), and possibly, (3) to understand disease mechanism and natural history through research (e.g., the repeated testing of patients with chronic conditions) (McNeil and Adelstein, 1976). A test can serve these purposes only if the health care provider knows how to interpret it. Diagnostic test studies are conducted to tell us how diagnostic tests perform and, thus, how they should be interpreted. There are several measures of diagnostic test performance. Fryback and Thornbury (1991) described a hierarchical model for studying diagnostic performance for imaging tests. The model starts with image quality and progresses to diagnostic accuracy, effect on treatment decisions, impact on patient outcome, and finally costs to society. A key feature of the model is that for a diagnostic test to be efficacious at a higher level, it must be efficacious at all lower levels. The reverse is not true; for example, a new test may have better accuracy than a standard test but may be too costly (in terms of monetary expense and/or patient morbidity due to complications) to be efficacious. In this book, we deal exclusively with the assessment of diagnostic accuracy (level 2 of the hierarchical model), recognizing that it is only one step in the complete assessment of a diagnostic test.
Diagnostic test accuracy is simply the ability of the test to discriminate among alternative states of health (Zweig and Campbell, 1993). If a test’s results do not differ between alternative states of health, then the test has negligible accuracy; if the results do not overlap for the different health states, then the test has perfect accuracy. Most test accuracies fall between these two extremes. It’s important to recognize that a test result is not a true representation of the patient’s condition (Sox et al., 1989). Most diagnostic information is imperfect; it may influence the health care provider’s thinking, but uncertainty remains about the patient’s true condition. If the test is negative for the condition, should the health care provider assume that the patient is disease-free and thus send him or her home? If the test is positive, should the health care provider assume the patient has the condition and thus begin treatment? Finally, if the test result requires interpretation by a trained reader (e.g., a radiologist), should the health care provider seek a second interpretation?
To answer these critical questions, the health care provider needs to have information on the test’s absolute and relative capabilities and an understanding of the complex interactions between the test and the trained readers who interpret the imaging data (Beam, 1992). The health care provider must ask: How does the test perform among patients with the condition (i.e., the test’s sensitivity)? How does the test perform among patients without the condition (i.e., the test’s specificity)? Does the test serve as a replacement for an older test or should multiple tests be performed? If multiple tests are performed, how should they be executed (i.e., sequentially or in parallel)? How reproducible are interpretations by different readers? These sorts of questions are addressed in diagnostic test accuracy studies.
Diagnostic test accuracy studies have three common features: a sample of subjects who have, or will, undergo one or more of the diagnostic medical tests under evaluation; some form of interpretation or scoring of the test’s findings; and a reference, or gold standard, to which the test findings are compared. This may sound simple enough, but diagnostic accuracy studies are difficult to design. Here are three common misperceptions about diagnostic test accuracy.
The first misperception involves the interpretation of diagnostic tests. Investigators of new diagnostic tests sometimes develop criteria for interpreting their tests based only on the findings from healthy volunteers. For example, in a new test to detect pancreatitis, investigators measure the amount of a certain enzyme in healthy volunteers. A typical decision criterion, or cutpoint, is three standard deviations (SDs) below the mean of the normals. New patients with an enzyme level of three SDs below the mean of the healthy volunteers are labeled “positive” for pancreatitis; patients with enzyme levels above this cutpoint are labeled “negative”. In proposing such a criterion, investigators fail to recognize (1) the relevance of the natural distributions of the test results (i.e. are they really Gaussian [normal]?); (2) the magnitude of any overlap between the test results of patients with and without pancreatitis (i.e. are the test results from most pancreatitis patients 3 SDs below the mean?); (3) the clinical significance of diagnostic errors (i.e. falsely labeling a patient without pancreatitis as “positive” for the condition and falsely labeling a patient with pancreatitis as “negative”); and (4) the poor generalization of results from studies based on healthy volunteers (i.e. healthy volunteers may have very different enzyme levels than sick patients without pancreatitis who might undergo the test). In Chapter 2, we discuss factors involved in determining optimal cutpoints for diagnostic tests; in Chapter 4, we discuss methods of finding optimal cutpoints and estimating diagnostic errors associated with them.
Another common misperception in diagnostic test studies is the notion that a rigorous assessment of a patient’s true condition - with the exclusion of patients for whom a less rigorous assessment was made - allows for a scientifically sound study. An example comes from literature on the use of ventilation-perfusion lung scans for diagnosing pulmonary emboli. The ventilation-perfusion lung scan is a noninvasive test used to screen high-risk patients for pulmonary emboli; its accuracy in various populations is unknown. Pulmonary angiography, on the other hand, is a highly accurate but invasive test. It is often used as a reference for assessing the accuracy of other tests. (See Chapter 2 for the definition and examples of gold standards.) To assess the accuracy of ventilation-perfusion lung scans, patients who have undergone both a ventilation-perfusion lung scan and a pulmonary angiogram are recruited, while patients who did not undergo the angiogram are excluded. Such a design usually leads to biased estimates of test accuracy. The reason is that the study sample is not representative of the patient population undergoing ventilation-perfusion lung scans - rather, patients with a positive scan are often recommended for angiograms, while patients with a negative scan are often not sent for an angiogram because of the risk of complications with it. In Chapter 3, we define work-up bias, and its most common form, verification bias, as well as strategies to avoid them. In Chapter 10, we present statistical methods developed specifically to correct for verification bias.
A third error common in diagnostic test accuracy studies involves confusion between accuracy and agreement. Investigators sometimes draw incorrect conclusions about a new test’s diagnostic accuracy because it agrees well with a conventional test; however, what if the new and conventional tests do not agree? We cannot simply conclude that the new test has inferior accuracy. In fact, a new test with superior accuracy will definitely disagree sometimes with the conventional test. Similarly, the two tests may have the same accuracy but make mistakes on different patients, resulting in poor agreement. A more valid approach to assessing a new test’s diagnostic accuracy is to compare both tests against a gold standard reference. Assessment of diagnostic accuracy is usually more difficult than assessment of agreement, but it is a more relevant and valid approach (Zweig and Campbell, 1993). In Chapter 5, we present methods for comparing the accuracy of two tests when the true diagnoses of the patients are known; in Chapter 11 we present methods for comparing two tests’ accuracies when the true diagnoses are unknown.
There is no question that studies of diagnostic test accuracy are challenging to design and require specialized statistical methods for their analysis. We will present and illustrate concepts and methods for designing, analyzing, interpreting, and reporting studies of diagnostic test accuracy. In Part I (Chapters 2-7) we define various measures of diagnostic accuracy, describe strategies for designing diagnostic accuracy studies, and present the basic statistical methods for estimating and comparing test accuracy, calculating sample size, and synthesizing the literature for meta-analysis. In Part II (Chapters 8-12) we present more advanced statistical methods for describing a test’s accuracy when patient characteristics affect it, for analyzing multi-reader studies, studies with verification bias or imperfect gold standards, and for performing meta-analyses.
We introduce three diagnostic test accuracy studies to illustrate the kinds of designs, questions, and statistical issues that arise in diagnostic medicine. These case studies, along with many other examples, are used throughout the book to illustrate various statistical methods. The datasets for these case studies are given in Appendix at the end of the book.
Parathyroid glands are small endocrine glands usually located in the neck or upper chest that produce a hormone that controls the body’s calcium levels. Most people have four parathyroid glands. In the most common form of parathyroid disease, one of these glands grows into a benign tumor, called a parathyroid adenoma, which produces excess amounts of parathyroid hormone. In a less common condition, called parathyroid hyperplasia, all four parathyroid glands become enlarged and secrete excess parathyroid hormone. In both conditions, a patient’s serum calcium levels become elevated, and the patient experiences loss of energy, depression, kidney stones, and headaches. Surgical removal of the offending parathyroid lesion is considered curative in most cases.
Single photon emission computed tomography (SPECT) using the radiopharmaceutical Tc-99m sestamibi is a nuclear medicine imaging test used to detect and localize parathyroid lesions prior to surgical intervention. In this prospective study (Donald Neumann, MD, PhD, Cleveland Clinic, Ohio, personal communication, 2007), 61 consecutive patients with hyperparathyroidism were imaged using a hybrid SPECT /CT instrument in an attempt to localize the diseased parathyroid glands preoperatively. Each patient underwent SPECT imaging, both with and without attenuation correction, as well as SPECT combined with CT imaging. Following imaging, the patients went to surgery to remove the diseased glands. The goal of the study was to compare the accuracy of these three tests.
One expert nuclear radiologist, blinded to the surgical findings, interpreted the images. On the SPECT imaging, each gland was scored on a scale from 1-7, with 1=definitely no disease, 2=probably no disease, 3=indeterminate, 4=maybe diseased, and 5=definitely diseased. Scores of 5, 6, and 7 were all considered definitely diseased but were distinguished by the intensity of the attenuation: 5=low, 6=medium, and 7=high, respectively. The SPECT/CT images were scored using just the 1-5 part of the scale. For this study, SPECT images scored as 1-3 were considered negative and scores of 4-7 as positive. For SPECT/CT, scores of 1-3 were considered negative and scores of 4-5 were considered positive. 97 glands in 61 patients were localized by imaging prior to undergoing parathyroid surgery, the results of which were considered the gold standard.
The investigators wanted to compare the sensitivity and specificity of these three tests to determine which single test should be used for future patients. In Chapter 2, we show that one of these tests appears more sensitive than the others, while another test appears more specific. A comparison of the tests’ Receiver Operating Characteristic (ROC) curves gives us a complete understanding of the strengths and weaknesses of the three tests and thus allows us to identify the most suitable test for preoperative patients.
The data from this study are complicated by the fact that many of the 61 patients had multiple glands visualized at screening, so called “clustered data.” Observations from the same patient, even if from different glands, are usually correlated, at least to some small degree. If we ignore this correlation, then the resulting confidence intervals and p-values can be misleading. In Chapters 4 and 5, we describe a simple analysis method that can be used for clustered data so that confidence intervals and p-values are correct.
Polyps that form in the colon or rectum can progress to cancer without any signs or symptoms. Computed tomography colonography (CTC) is an imaging test that can detect polyps before they develop into cancer. Radiologists sometimes overlook polyps on the CTC images, however, and these missed polyps (“false negatives”) can develop into cancer, which can lead to symptoms, even death. Investigators have developed a computer algorithm, called computer aided detection (CAD), to help radiologists detect polyps on the CTC. The CAD utilizes tissue intensity, volumetric and surface shape, and texture characteristics to identify suspicious areas. The CAD marks the suspicious areas for the reader to exam more closely. Often, the CAD identifies multiple suspicious areas on the same image. The radiologist must distinguish marked areas that contain a polyp (“true positive”) from marked areas that do not contain a polyp, for example a folded bowel lining (“false positive”).
In this study (Baker et al., 2007), the investigators wanted to compare radiologists’ accuracy without CAD to their accuracy with CAD to determine if CAD improves radiologists’ accuracy. Seven radiologists from two institutions participated in the study. The readers had varying levels of overall experience with abdominal imaging, as well as varying levels of training with CTC imaging technology. Overall, the 7 were considered inexperienced CTC readers.
Two hundred seventy patients from six institutions were compiled in this retrospective design. These 270 patients had undergone CTC for the following reasons: screening, follow-up exams for polyps detected in a prior exam, and failed prior colonoscopy including patients at risk for colon polyps/carcinoma but who were deemed not suitable candidates for a colonoscopy. An expert abdominal imager with extensive CTC experience and with knowledge about each patient’s follow-up (clinical, imaging, pathologic, and surgical), stratified the 270-patient sample into presence versus absence of a polyp; cases with a polyp were further stratified by polyp size (less than 6 mm “small”, 6-9 mm “medium”, or 10 mm or larger “large”). One hundred forty-one training cases were randomly sampled from the different strata to improve the CAD algorithm and train the readers. From the remaining 119 test cases, 30 were randomly selected to be used in this reader performance study; the study sample was composed of 25 positive cases with at least one polyp of middle to large size (a total of 39 polyps) and five cases with no polyps.
The seven readers were each given a unique order for reading the 30 images. First without CAD, the reader marked all findings. The reader used a pulldown window to identify the location of each finding according to one of eight colon segments. The reader then scored each finding according to their confidence that a polyp was present: !=definitely not a polyp; 2=probably not a polyp; 3=indeterminate; 4=maybe a polyp; and, 5=definitely a polyp. When the reader’s interpretation without CAD was completed, the reader was given a list of potential polyps detected by the CAD. Any CAD marks that coincided with a lesion found by the reader without CAD were not presented to the reader and were discarded. New CAD marks were scored by the reader using the 1-5 rating scale. The investigators in this study want to know if the CAD improves inexperienced radiologists’ accuracy over their accuracy without CAD (“unaided setting”). The seven-reader design helps us to get a better estimate of reader accuracy, but also complicates the analyses because the readers’ findings are correlated by the fact that they all interpreted the same sample of 30 patients. Sensitivity was defined for this study as correct detection of a polyp in a patient with polyps, and, in addition, required that the reader identify the correct location of the polyp. If the wrong location was chosen, then the missed polyp was considered a false negative. In this study patients can have multiple true positives (i.e. multiple correctly located polyps in the same patient), as well as a mixture of true positive, false positive, false negative, and true negative findings. Clustered data complicates the statistical analyses, but statistical methods are presented in Chapters 4 and 5 to handle these data appropriately.
Excessive plaque formation, or stenosis, in the carotid (neck) artery can lead to transient ischemic attacks (TIAs) or even stroke. Conventional catheter angiography is an invasive diagnostic test used by physicians to examine the carotid arteries in patients who have suffered a TIA or stroke. Because the test is invasive, there are risks associated with the test including stroke and death. Magnetic Resonance Angiography (MRA) is a non-invasive test that may help physicians examine the carotid arteries without risk. Patients with other cardiovascular problems who are at high risk for plaque formation in the carotid arteries can also benefit from such a noninvasive screening test.
In this study, investigators (Thomas Masaryk, MD, Cleveland Clinic, Ohio, personal communication, 2007) wanted to assess the accuracy of MRA for detecting carotid artery plaque. Patients scheduled for a conventional catheter angiogram because they had suffered a recent stroke (symptomatic) or because they were at high risk for suffering a stroke in the future (asymptomatic) were asked to participate in this study. One hundred sixty-three patients were prospectively recruited for the study. These patients first underwent an MRA, then a conventional catheter angiogram.
Four radiologists from three institutions independently interpreted the conventional catheter angiograms, and the same four radiologists independently interpreted the MRA images. At least two weeks passed between the catheter angiogram and MRA interpretations; the study ID numbers were changed so that there was no obvious connection between the catheter angiogram and MRA images.
A significant stenosis requiring surgical intervention was defined as stenosis that blocked 60-99 percent of the carotid vessel. Note that arteries that are completely blocked (100 percent stenosis, or occlusions) are not considered good surgical lesions. The radiologists were asked to grade their confidence that a significant stenosis was present using a 5-point scale: !=definitely no significant stenosis, 2=probably no significant stenosis, 3=equivocal, 4=probably significant stenosis, and 5=definitely significant stenosis. They were also asked to indicate the percent of stenosis present (a number between 0 and 100). The radiologists responded to these questions for both the left and right sides for both MRA and conventional catheter angiography.
In this study the investigators want to know the accuracy of MRA and whether or not it can replace the conventional invasive test, catheter angiography. The data are complicated by the multiple-reader design, as well as by the fact that the data are clustered (i.e. findings from both the left and right carotid arteries in the same patient). There are several patient characteristics, such as gender, age, and symptoms, which the investigators suspect may affect the accuracy of MRA. In Chapter 3, we discuss the kinds of effects that covariates can have on diagnostic test accuracy; in Chapter 8, we discuss various regression methods to handle covariate data. Finally, we note that the gold standard for this study, catheter angiography, is not a perfect test and radiologists often disagree in their interpretations of its findings. Fortunately, there are statistical methods, which we describe in Chapter 11, that deal with studies with imperfect reference standards.
A variety of software has been written to implement many of the statistical methods discussed in this book. These programs can be found in FORTRAN, SAS macros (SAS Institute, Cary, North Carolina, USA), Stata (Stata Data Analysis and Statistical Software, Stata Corp LP, College Station, Texas), and R (free software at http:/jwww.r-project.org/). The authors have prepared a Web site (http://faculty.washington.edu/azhou/books/diagnostic.html) that contains links to some useful software. The web site will be maintained and updated periodically for at least five years after this book’s publication date.
Although this book covers the main themes in statistical methods for diagnostic medicine, it does not cover several related topics, as follows.
Decision analysis, cost-effectiveness analysis, and cost-benefit analysis are methods commonly used to quantify the long-term, or downstream, effects of a test on the patient and society. In Chapters 2 and 4, we discuss how these methods can be applied to find the optimal cutpoint on the ROC curve. Description of how to perform these methods, however, is beyond the scope of this book. There are many excellent references on these topics, including (Gold et al., 1996), (Pauker and Kassirer, 1975), (Russell et al., 1996), (Weinstein et al., 1996), (Drummond et al., 2005), (Glick et al., 2007), and (Willan and Briggs, 2006).
Most of the methods we present for estimation and hypothesis testing are from a frequentist perspective. Bayesian methods can also be used, whereby one incorporates into the assessment of the diagnostic test some previously acquired information or expert opinion about a test’s characteristics or information about the patient or population. Examples of Bayesian methods used in diagnostic testing include Gatsonis (1995); Joseph et al. (1995); Peng and Hall (1996); Hellmich et al. (1988); O’Malley and Zou (2001); Broemeling (2007).
Finally, when there are multiple diagnostic tests performed on a patient, we may want to combine the information from the tests in order to make the best possible diagnosis. See, for example, Pepe and Thompson (2000), Zhou et al. (2011), and Lin et al. (2011) for various methods for combining tests’ results to optimize diagnostic accuracy.
In this chapter we describe various measures of the accuracy of diagnostic tests. We begin by introducing measures of intrinsic accuracy, a test’s inherent ability to correctly detect a condition when it is actually present and to correctly rule out a condition when it is truly absent. These attributes are considered fundamental to the tests themselves. They do not change for different samples of patients with different prevalence rates of disease. It is important to recognize, however, which these attributes can change somewhat over time and population as the technical specifications of the imaging machine, the clinician interpreting the test, and the characteristics of the patient (e.g. severity of disease) change.
The intrinsic accuracy of a test is measured by comparing the test results to the true condition status of the patient. We assume for most of our discussion that the true condition status is one of two mutually exclusive states: “the condition is present” or “the condition is absent Some examples are the presence versus the absence of parathyroid disease, the presence of a malignant versus benign tumor, and the presence of one versus more than one tumor. We determine the true condition status by means of a gold standard. A gold standard is a source of information, completely different from the test or tests under evaluation, which tells us the true condition status of the patient. Different gold standards are used for different tests and applications; some common examples are autopsy reports, surgery findings, pathology results from biopsy specimens, and the results of other diagnostic tests that have perfect or near perfect accuracy. In Chapter 3, we discuss more about the selection of a gold standard; in Chapter 11 we present statistical methods for measuring diagnostic accuracy without a gold standard.
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
