E-Book
124,99 €

Statistical Methods in Diagnostic Medicine E-Book

Xiao-Hua Zhou

0,0

124,99 €

oder

Leseprobe lesen

Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.

Mehr erfahren.

Herausgeber: John Wiley & Sons
Kategorie: Wissenschaft und neue Technologien
Serie: Wiley Series in Probability and Statistics
Sprache: Englisch

Beschreibung

Praise for the First Edition

" . . . the book is a valuable addition to the literature in the field, serving as a much-needed guide for both clinicians and advanced students."—Zentralblatt MATH

A new edition of the cutting-edge guide to diagnostic tests in medical research

In recent years, a considerable amount of research has focused on evolving methods for designing and analyzing diagnostic accuracy studies. Statistical Methods in Diagnostic Medicine, Second Edition continues to provide a comprehensive approach to the topic, guiding readers through the necessary practices for understanding these studies and generalizing the results to patient populations.

Following a basic introduction to measuring test accuracy and study design, the authors successfully define various measures of diagnostic accuracy, describe strategies for designing diagnostic accuracy studies, and present key statistical methods for estimating and comparing test accuracy. Topics new to the Second Edition include:

Methods for tests designed to detect and locate lesions
Recommendations for covariate-adjustment
Methods for estimating and comparing predictive values and sample size calculations
Correcting techniques for verification and imperfect standard biases
Sample size calculation for multiple reader studies when pilot data are available
Updated meta-analysis methods, now incorporating random effects

Three case studies thoroughly showcase some of the questions and statistical issues that arise in diagnostic medicine, with all associated data provided in detailed appendices. A related web site features Fortran, SAS®, and R software packages so that readers can conduct their own analyses.

Statistical Methods in Diagnostic Medicine, Second Edition is an excellent supplement for biostatistics courses at the graduate level. It also serves as a valuable reference for clinicians and researchers working in the fields of medicine, epidemiology, and biostatistics.

Details

Sie lesen das E-Book in den Legimi-Apps auf:

Android

iOS

von Legimi
zertifizierten E-Readern

Seitenzahl: 925

Veröffentlichungsjahr: 2014

Bewertungen

0,0

Rezensionen(0 Rezensionen)

Leseprobe

CONTENTS

LIST OF FIGURES

LIST OF TABLES

0.1 PREFACE

0.2 ACKNOWLEDGEMENTS

PART I: BASIC CONCEPTS AND METHODS

CHAPTER 1: INTRODUCTION

1.1 DIAGNOSTIC TEST ACCURACY STUDIES

1.2 CASE STUDIES

1.3 SOFTWARE

1.4 TOPICS NOT COVERED IN THIS BOOK

CHAPTER 2: MEASURES OF DIAGNOSTIC ACCURACY

2.1 SENSITIVITY AND SPECIFICITY

2.2 COMBINED MEASURES OF SENSITIVITY AND SPECIFICITY

2.3 RECEIVER OPERATING CHARACTERISTIC (ROC) CURVE

2.4 AREA UNDER THE ROC CURVE

2.5 SENSITIVITY AT FIXED FPR

2.6 PARTIAL AREA UNDER THE ROC CURVE

2.7 LIKELIHOOD RATIOS

2.8 ROC ANALYSIS WHEN THE TRUE DIAGNOSIS IS NOT BINARY

2.9 C-STATISTICS AND OTHER MEASURES TO COMPARE PREDICTION MODELS

2.10 DETECTION AND LOCALIZATION OF MULTIPLE LESIONS

2.11 POSITIVE AND NEGATIVE PREDICTIVE VALUES, BAYES THEOREM, AND CASE STUDY 2

2.12 OPTIMAL DECISION THRESHOLD ON THE ROC CURVE

2.13 INTERPRETING THE RESULTS OF MULTIPLE TESTS

CHAPTER 3: DESIGN OF DIAGNOSTIC ACCURACY STUDIES

3.1 ESTABLISH THE OBJECTIVE OF THE STUDY

3.2 IDENTIFY THE TARGET PATIENT POPULATION

3.3 SELECT A SAMPLING PLAN FOR PATIENTS

3.4 SELECT THE GOLD STANDARD

3.5 CHOOSE A MEASURE OF ACCURACY

3.6 IDENTIFY TARGET READER POPULATION

3.7 SELECT SAMPLING PLAN FOR READERS

3.8 PLAN DATA COLLECTION

3.9 PLAN DATA ANALYSES

3.10 DETERMINE SAMPLE SIZE

CHAPTER 4: ESTIMATION AND HYPOTHESIS TESTING IN A SINGLE SAMPLE

4.1 BINARY-SCALE DATA

4.2 ORDINAL-SCALE DATA

4.3 CONTINUOUS-SCALE DATA

4.4 TESTING THE HYPOTHESIS THAT THE ROC CURVE AREA OR PARTIAL AREA IS A SPECIFIC VALUE

CHAPTER 5: COMPARING THE ACCURACY OF TWO DIAGNOSTIC TESTS

5.1 BINARY-SCALE DATA

5.2 ORDINAL- AND CONTINUOUS-SCALE DATA

5.3 TESTS OF EQUIVALENCE

CHAPTER 6: SAMPLE SIZE CALCULATIONS

6.1 STUDIES ESTIMATING THE ACCURACY OF A SINGLE TEST

6.2 SAMPLE SIZE FOR DETECTING A DIFFERENCE IN ACCURACIES OF TWO TESTS

6.3 SAMPLE SIZE FOR ASSESSING NON-INFERIORITY OR EQUIVALENCY OF TWO TESTS

6.4 SAMPLE SIZE FOR DETERMINING A SUITABLE CUTOFF VALUE

6.5 SAMPLE SIZE DETERMINATION FOR MULTI-READER STUDIES

6.6 ALTERNATIVE TO SAMPLE SIZE FORMULAE

CHAPTER 7: INTRODUCTION TO META-ANALYSIS FOR DIAGNOSTIC ACCURACY STUDIES

7.1 OBJECTIVES

7.2 RETRIEVAL OF THE LITERATURE

7.3 INCLUSION/EXCLUSION CRITERIA

7.4 EXTRACTING INFORMATION FROM THE LITERATURE

7.5 STATISTICAL ANALYSIS

7.6 PUBLIC PRESENTATION

PART II: ADVANCED METHODS

CHAPTER 8: REGRESSION ANALYSIS FOR INDEPENDENT ROC DATA

8.1 FOUR CLINICAL STUDIES

8.2 REGRESSION MODELS FOR CONTINUOUS-SCALE TESTS

8.3 REGRESSION MODELS FOR ORDINAL-SCALE TESTS

8.4 COVARIATE ADJUSTED ROC CURVES OF CONTINUOUS-SCALE TESTS

CHAPTER 9: ANALYSIS OF MULTIPLE READER AND/OR MULTIPLE TEST STUDIES

9.1 STUDIES COMPARING MULTIPLE TESTS WITH COVARIATES

9.2 STUDIES WITH MULTIPLE READERS AND MULTIPLE TESTS

9.3 ANALYSIS OF MULTIPLE TESTS DESIGNED TO LOCATE AND DIAGNOSE LESIONS

CHAPTER 10: METHODS FOR CORRECTING VERIFICATION BIAS

10.1 EXAMPLES

10.2 IMPACT OF VERIFICATION BIAS

10.3 A SINGLE BINARY-SCALE TEST

10.4 CORRELATED BINARY-SCALE TESTS

10.5 A SINGLE ORDINAL-SCALE TEST

10.6 CORRELATED ORDINAL-SCALE TESTS

10.7 CONTINUOUS-SCALE TESTS

CHAPTER 11: METHODS FOR CORRECTING IMPERFECT GOLD STANDARD BIAS

11.1 EXAMPLES

11.2 IMPACT OF IMPERFECT GOLD STANDARD BIAS

11.3 ONE SINGLE BINARY TEST IN A SINGLE POPULATION

11.4 ONE SINGLE BINARY TEST IN G POPULATIONS

11.5 MULTIPLE BINARY TESTS IN ONE SINGLE POPULATION

11.6 MULTIPLE BINARY TESTS IN G POPULATIONS

11.7 MULTIPLE ORDINAL-SCALE TESTS IN ONE SINGLE POPULATION

11.8 MULTIPLE-SCALE TESTS IN ONE SINGLE POPULATION

CHAPTER 12: STATISTICAL ANALYSIS FOR META-ANALYSIS

12.1 BINARY-SCALE DATA

12.2 ORDINAL- OR CONTINUOUS-SCALE DATA

12.3 ROC CURVE AREA

APPENDIX A: CASE STUDIES AND CHAPTER 8 DATA

APPENDIX B: JACKKNIFE AND BOOTSTRAP METHODS OF ESTIMATING VARIANCES AND CONFIDENCE INTERVALS

REFERENCES

INDEX

WILEY SERIES IN PROBABILITY AND STATISTICS

Established by WALTER A. SHEWHART and SAMUEL S. WILKS

Editors: David J. Balding, Noel A. C. Cressie, Garrett M. Fitzmaurice, lain M Johnstone, Geert Molenberghs, David W Scott, Adrian F. M Smith, Ruey S. Tsay, Sanford Weisberg

Editors Emeriti: Vic Barnett, J. Stuart Hunter, Joseph B. Kadane, Jozef L. Teugels

A complete list of the titles in this series appears at the end of this volume.

Published by John Wiley & Sons, Inc., Hoboken, New Jersey.

Published simultaneously in Canada.

No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical, photocopying, recording, scanning, or otherwise, except as permitted under Section 107 or 108 of the 1976 United States Copyright Act, without either the prior written permission of the Publisher, or authorization through payment of the appropriate per-copy fee to the Copyright Clearance Center, Inc., 222 Rosewood Drive, Danvers, MA 01923, (978) 750-8400, fax (978) 750-4470, or on the web at www.copyright.com. Requests to the Publisher for permission should be addressed to the Permissions Department, John Wiley & Sons, Inc., Ill River Street, Hoboken, NJ 07030, (201) 748-6011, fax (201) 748-6008, or online at http://www.wiley.com/go/permission.

Limit of Liability/Disclaimer of Warranty: While the publisher and author have used their best efforts in preparing this book, they make no representations or warranties with respect to the accuracy or completeness of the contents of this book and specifically disclaim any implied warranties of merchantability or fitness for a particular purpose. No warranty may be created or extended by sales representatives or written sales materials. The advice and strategies contained herein may not be suitable for your situation. You should consult with a professional where appropriate. Neither the publisher nor author shall be liable for any loss of profit or any other commercial damages, including but not limited to special, incidental, consequential, or other damages.

For general information on our other products and services or for technical support, please contact our Customer Care Department within the United States at (800) 762-2974, outside the United States at (317) 572-3993 or fax (317) 572-4002.

Wiley also publishes its books in a variety of electronic formats. Some content that appears in print may not be available in electronic formats. For more information about Wiley products, visit our web site at www.wiley.com.

Library of Congress Cataloging-in-Publication Data is available.

ISBN 978-0470-18314-4

This book is dedicated to Yea-Jae, Ralph, and Tom

LIST OF FIGURES

2.1 Histogram of Gap Measurements of Patients With and Without Fractured Heart Valves

2.2 Empirical and Fitted ROC Curves for the Heart Valve- Imaging data

2.3 Empirical and Fitted ROC Curves of SPECT/CT in Case Study 1

2.4 Fitted ROC Curves for SPECT-CT and AC-SPECT in Case Study 1

2.5 A Perfectly Discriminating Test with ROC Area of 0.5

2.6 Two Tests with Crossing ROC Curves

2.7 ROC Curves and Their LRs

2.8 A Proper ROC Curve

2.9 An Improper ROC Curve

2.10 FROC Curve for Case Study 2

2.11 ROI-based ROC Curve for Case Study 2

2.12 The Relationship Between Pre-test Probability and the PPV

4.1 The Distribution of Latent Test Variables of Those With and Without the Condition

4.2 Empirical ROC Curve for the Use of MRA to Detect Significant Carotid Stenosis

4.3 Smooth ROC Curve for Use of MRA to Detect Significant Carotid Stenosis

4.4 The Partial ROC Curve Area Minimum and Maximum Values

4.5 The Nonparametric Area under the ROC Curve (Trapezoidal Rule)

4.6 Example of an ROC Curve for Degenerate Data

4.7 Empirical ROC Curve for the Head Trauma Data

4.8 Smooth ROC Curve for Head Trauma Data

7.1 Empirical SROC Curve of LAG to Detect Lymph Node Metastases in Cervical Cancer Patients.

7.2 Empirical SROC Curves of Regular Duplex and Color-Guided Duplex Ultrasound.

7.3 Smooth ROC Curve of Regular Duplex and Color-Guided Duplex Ultrasound.

8.1 Fitted ROC Curves for 60 Year-Old Patients

8.2 Empirical ROC Curves for the Two Biomarkers

8.3 Fitted ROC Curves for the Two Biomarkers

8.5 Fitted ROC Curves for the Four Ultrasound Readers Using the Indirect ROC Regression Method

8.6 Fitted ROC Curves for the Four Ultrasound Readers Using the Direct ROC Regression Method

9.1 Estimated ROC Curves

9.2 Covariate specific ROC curves

10.1 Estimates for Sensitivity and Specificity in the Hepatic Scintigraph Example When e1=0.57

10.2 Estimates for Sensitivity and Specificity in the Hepatic Scintigraph Example When e1=1.0

10.3 Estimates for Sensitivity and Specificity in the Hepatic Scintigraph Example When e1=1.72

10.4 Uncorrected and Corrected ROC Curves in the Fever of Uncertain Origin Example

10.5 ROC Curves and Empirical (FPF, TPF) Estimates for the Dementia Screening Test by Site and Age Group under the Best Model

10.6 NACC Example: Several Estimated Covariate-specific ROC Curves

10.7 NACC Example: Examination of Goodness of Fit

10.8 MMSE Distribution by AD Status and Verification

10.9 The Covariate-Specific ROC Curves

10.10 Estimated ROC Curve Using FI, MSI, IPW and PDR Approaches Under the Non-ignorable Verification

11.1 Cell count Residuals Under the CIA Model in the HIV Example

11.2 Pairwise Correlation Residuals Under the CIA Model in the HIV Example

11.3 Cell Count Residuals Under the 2LCR Model in the HIV Example

11.4 Pairwise Correlation Residuals Under the 2LCR Model in the HIV Example

11.5 Estimated ROC Curves for Each of the Seven Pathologists Based on Non-parametric Model

LIST OF TABLES

2.1 Basic 2x2 Count Table

2.2 2x2 Probability Table

2.3 CAD-aided CTC Results of 25 Patients With and 5 Without Colon Polyps

2.4 Gap Measurements of 10 Patients With and 10 Without Fractured Heart Valve

2.5 Estimates of Se and Sp From Heart Valve Imaging Study

2.6 SPECT/CT Using 5-Category Scale from Table A.l in Appendix

2.7 CTC Results of 5250 Patients

2.8 Estimates of Accuracy for Case Study 1

2.9 ROC Areas for Some Diagnostic Tests

2.10 Estimating LR(t) from Heart Valve Imaging Study

3.1 Steps in Designing Diagnostic Accuracy Studies

3.2 Phases in Assessment of a Test’s Diagnostic Accuracy

3.3 Possible Sampling Plans for Case Study 3 (Phase III)

3.4 Common Biases in Studies of Diagnostic Test Accuracy

3.5 Example Illustrating the Typical Effect of an Imperfect Gold Standard on the Estimates of Accuracy

3.6 Verification Status of 87 Patients with Suspected Graft Infection

3.7 Accuracy Data for 39 Surgical Patients. Estimated Se=0.80, Sp=0.71

3.8 Accuracy Data for 87 Patients

3.9 Data Set-Up for Traditional MRMC Design

3.10 Data Set-Up for Unpaired-Patient, Unpaired-Reader MRMC Design

3.11 Data Set-Up for Unpaired-Patient, Paired-Reader MRMC Design

3.12 Data Set-Up for Paired-Patient, Unpaired-Reader MRMC Design

3.13 Two Common Study Designs for Computer-Aided Detection (CAD) Studies

3.14 Steps in Determining When Covariate Adjustment is Needed

4.1 Display of Binary Data

4.2 CAD Enhanced Computed Tomography Colonography Results for Detection of Colon Polyps (Reader 1)

4.3 Display of CK Results (2 categories) for Diagnosis of AMI

4.4 Display of CK Results (5 categories) for Diagnosis of AMI

4.5 Likelihood Ratio Results for AMI

4.6 Results of Magnetic Resonance Angiography (MRA) to Detect Significant Carotid Stenosis, Radiologist 4

4.7 Display of Ordinal Data

4.8 Estimation of ROC Curve Points for Smooth ROC Curve

4.9 Data for 60 Patients with Severe Head Trauma

4.10 Descriptive Data for CK-BB Measurements in 60 Severe Head Trauma Patients with Good or Poor Outcome

4.11 Estimation of Parameters and the Total Area under the ROC Curve for the CK-BB Test to Predict Poor Outcome of Severe Head Trauma

4.12 Optimal Operating Points: Using the CK-BB Test to Predict Poor Outcomes of Severe Head Trauma as a Function of the Prevalence and Relative Costs

5.1 Display of Unpaired Binary-Scale Data to Compare Sensitivity of Two Medical Tests (Subjects with the Condition)

5.2 Display of Paired Binary Data to Compare Sensitivity, Specificity and Predictive Values of Two Medical Tests

5.3 Paired Test Results of Radiologists 3 and 4 for Using MRA to Detect Significant Carotid Stenosis

5.4 Computed Tomography Colonography (CTC) Without CAD Enhancement, for Detection of Colon Polyps with Reader 1

5.5 Parameter Estimates, Variances and Covariances for CK-BB Enzyme Data for Younger and Older Head Trauma Patients

5.6 Paired Test Results for MRA Detection of Significant Carotid Stenosis

5.7 Variance Covariance Matrix of Parameter Estimates for MRA Example

6.1 Pairs of Binormal Parameters for Different Values of the ROC Area

6.2 Number of Patients with the Condition Required for Comparing the Same Two ROC Curves But Using Different Measures of Accuracy.

6.3 Parameters Needed for Planning Multi-Reader ROC Study

6.4 Parameter Values in Various MRMC Study Designs

6.5 Estimated Power for Various MRMC Study Designs and Sample Sizes.

7.1 Data for 14 Studies of Duplex and Color Guided Duplex Ultrasonography to Detect Serious Stenosis.

7.2 Worksheet to Construct SROC Curves for Comparing Duplex and Color Guided Duplex Ultrasonography to Detect Serious Stenosis.

7.3 Results of Estimating Regression Equation for Comparing Regular and Color-Guided Duplex US to Detect Serious Stenosis

7.4 Summary of Seven Studies of the Dexamethasone Suppression Test.

8.1 Ultrasound Rating Data By the Four Radiologists

8.2 Estimated Parameters and Standard Errors in the Hearing Test Example

8.3 Parameter Estimates and the Associated Standard Deviations in the Ordinal Regression Model for Ultrasound Rating Data

9.1 Estimates for Cutoff Points, Location and Scale Parameters in Marginal Ordinal Regression Model (9.13)

9.2 95% CI’s for AUC’s for the Three MR Imaging Techniques by Institution for the Prostate Cancer Example

9.3 Parameter Estimates with Bootstrapped 95%CI

9.4 Estimated AUC’s and the Differences in AUC’s with Bootstrapped 95%CI

9.5 Analysis of Variance Table for Mixed Effect Model (9.27)

10.1 Observed Data for a Single Binary Scale Test

10.2 Hepatic Scintigraph Data

10.3 Observed Data for Two Paired Binary Scale Tests

10.4 Paired Screening Tests for Dementia Among 75 or Older Subjects in Indianapolis

10.6 Data from the Alzheimer study: Entries are Numbers of Subjects

10.7 Observed Data for a Single Ordinal Scale Test

10.8 CT Data of 53 Patients with Fever of Uncertain Origin

10.10 New Screening Test for Dementia Disorder Data

10.11 Area under the ROC Curve of the Ordinal Scale Screening Test

10.12 P-Values for the Null Hypothesis That There Are No Differences in ROC Areas

10.13 Observed Data for Paired Ordinal Scale Tests at the gth Covariate Pattern

10.14 The Clinical Assessment (T) Data in the NACC MDS Example

10.15 The Estimated Location and Scale Parameters with the Associated Standard Errors for the NACC MDS Clinical Diagnosis Data under the MAR Verification Assumption

10.16 MRI and CT Data in Staging Pancreatic Cancer Example

10.17 The Estimated Location and Scale Parameters with the Associated Standard Errors for the MMSE Test Using the NACC MDS Data

10.18 Estimated Model Parameters and Their Standard Errors for the MMSE Data under a Non-MAR Verification Model. (Significant Coefficients are in Bold Font.)

11.1 Results of a New Test (T) and an Imperfect Gold Standard (R) with Known Sensitivity of 90% and Specificity of 70%

11.2 Conditional Joint Probability of a New Test (T) and an Imperfect Gold Standard (R) Given the Condition Status (D)394

11.3 Classification of Results by T and R Based on the Conditional Joint Probability Given in Table (11.2)

11.4 A General Data Structure For a Single Test and Imperfect Gold Standard

11.5 Results by Stool Examination (T) and Serologic Test (R) for Strongyloides Infection

11.6 Results by Stool Examination (T) and Serologic test (R) For Strongyloides Infection

11.7 Marginal Prior and Posterior Medians and Equal-tailed 95% Credible Intervals in Strongyloides Infection Example Under the CIA

11.8 Marginal Prior and Posterior Medians and Equal-tailed 95% Credible Intervals in Strongyloides Infection Example Under a Conditional Dependence Model

11.9 Results of Mantoux and Tine Tests for Tuberculosis in Two Populations

11.10 Parameter Estimates of Mantoux and Tine Tests for Tuberculosis in Two populations Using the ML Approach Under the CIA

11.11 Bayesian Results for the Accuracy of Mantoux and Tine Tests Under the CIA

11.12 Bayesian Estimates of the Accuracy of Mantoux and Tine Tests for Tuberculosis Assuming a Conditional Dependence Model

11.13 Assessments of Pleural Thickening by Three Radiologists

11.14 ML Parameter Estimates for the Pleural Thickening Example

11.15 Results of the Accuracy of Four HIV Tests

11.16 ML Estimates and the 95% Confidence Intervals for Sensitivity and Specificity of the Four HIV Tests Using a Random Effect Model

11.17 ML Estimates for Sensitivity and Specificity of the Four HIV Tests, with 95% Confidence Intervals, Using Marginal Log-linear Models

11.18 Assessments of Left Carotid Artery Stenosis by Three Radiologists

11.19 Parameter Estimates of Left Carotid Artery Stenosis Data

A.l Parathyroid Disease Data for Case Study 1

A.2 Sensitivity Data for Case Study 2

A.3 # False Positives per Patient for Case Study 2

A.4 ROC Data for Case Study 2

A.5 Carotid Artery Stenosis Results for Case Study 3

A.6 Biomarker Data for Pancreatic Patients in Chapter 8

0.1 PREFACE

Diagnostic tests play a pivotal role in medicine, often determining what additional diagnostic tests, treatments, and interventions are needed and ultimately affecting patients' outcomes. Given the importance of this role, it is critical that clinicians are given reliable data about the accuracy of the diagnostic tests they order. These clinicians need well-designed diagnostic accuracy studies and they need to understand how the results of these studies apply to their patients. The purpose of this book, then, is two-fold: to provide a comprehensive approach to designing and analyzing diagnostic accuracy studies and to aid clinicians in understanding these studies and in generalizing study results to their patient populations.

Since the first edition, we have updated each chapter with recently published methods. These updates include new methods for tests designed to detect and locate lesions (see Chapters 2, 3, and 9), recommendations for the type of covariate-adjustment needed (Chapter 3) along with new methods for covariate-adjustment (Chapter 8), estimating and comparing predictive values (Chapters 4 and 5) and calculating sample size for studies using predictive values (Chapter 6), sample size calculation for multiple reader studies when pilot data are available (Chapter 6), new methods for correcting for verification bias in estimation of ROC curves of continuous-scale tests (Chapter 10), and new methods for correcting for imperfect standard bias in estimation of ROC curves of ordinal-scale or continuous-scale tests (Chapter 11).

We have also added three case studies: a positron emission tomography (PET) study comparing the accuracy of three tests for detecting diseased parathyroid glands, a computer-aided detection (CAD) study of colon polyps, and a magnetic resonance imaging study of atherosclerosis in the carotid arteries (see Chapter 1). The data from these case studies are provided in the Appendix and are used throughout the book as illustrations of various statistical methods.

The book is organized such that the more basic material about measures of test accuracy and study design appear first (Chapters 2 and 3, respectively), followed by chapters on statistical methods of data analysis with real data examples to illustrate these methods. Chapters 4 and 5 illustrate methods for estimating accuracy and comparing tests' accuracies under a variety of study designs. Calculating the sample size required for a study is described in Chapter 6. Chapters 7 and 12 focus on the design and analysis of meta-analyses of diagnostic test accuracy. Chapters 8 and 9 look at models of diagnostic test accuracy for various patient subgroups and for multiple-reader studies, respectively. Corrections for estimates of test accuracy in studies with verification bias and imperfect gold standards are illustrated in Chapters 10 and 11. Chapters 1-3 are accessible to readers with a basic knowledge of statistical and medical terminology. Chapters 4-7 are geared to the data analyst with basic training in biostatistics. In Chapters 8-12 we provide more detailed statistical methodology for readers with more statistical training, but the examples in these chapters are accessible to all readers. The only needed change is to add mention of the books related Web site to the Preface. The authors have prepared a Web site (http://faculty.washington.edu/azhou/books/diagnostic.html) that contains links to some useful software.

0.2 ACKNOWLEDGEMENTS

We are thankful to many colleagues for supporting us during the writing and publication of both the first (2002) and second (2011) edition of this book. Their helpful critiques and suggestions about the first edition have led to this improved second edition. Particularly, we would like to thank Danping Liu and Zheyu Wang for their helpful comments on the manuscript and their computational assistance in implementing some of methods discussed in the book. We would like to thank Dr. Thomas D. Koepsell for his helpful comments on the manuscript.

We would also like to thank our families for their understanding and encouragement. Dr. Xiao-Hua (Andrew) Zhou thanks his wife, Yea-Jae, and their children, Vanessa and Joshua. Dr. Nancy Obuchowski thanks her husband, Dr. Ralph Harvey, and their children, Thcker, Eli, and Scout. Dr. Donna McClish thanks her husband, Tom, and their daughter Amanda.

PART I

BASIC CONCEPTS AND METHODS

CHAPTER 1 INTRODUCTION

1.1 DIAGNOSTIC TEST ACCURACY STUDIES

Diagnostic medicine is the process of identifying the disease, or condition, that a patient has, and ruling out conditions that the patient does not have, through assessment of the patient’s signs, symptoms, and results of various diagnostic tests. Diagnostic accuracy studies are research studies which examine the ability of diagnostic tests to discriminate between patients with and without the condition; these studies are the focus of this book.

A diagnostic test has several purposes: (1) to provide reliable information about the patient’s condition, (2) to influence the health care provider’s plan for managing the patient (Sox et al., 1989), and possibly, (3) to understand disease mechanism and natural history through research (e.g., the repeated testing of patients with chronic conditions) (McNeil and Adelstein, 1976). A test can serve these purposes only if the health care provider knows how to interpret it. Diagnostic test studies are conducted to tell us how diagnostic tests perform and, thus, how they should be interpreted. There are several measures of diagnostic test performance. Fryback and Thornbury (1991) described a hierarchical model for studying diagnostic performance for imaging tests. The model starts with image quality and progresses to diagnostic accuracy, effect on treatment decisions, impact on patient outcome, and finally costs to society. A key feature of the model is that for a diagnostic test to be efficacious at a higher level, it must be efficacious at all lower levels. The reverse is not true; for example, a new test may have better accuracy than a standard test but may be too costly (in terms of monetary expense and/or patient morbidity due to complications) to be efficacious. In this book, we deal exclusively with the assessment of diagnostic accuracy (level 2 of the hierarchical model), recognizing that it is only one step in the complete assessment of a diagnostic test.

Diagnostic test accuracy is simply the ability of the test to discriminate among alternative states of health (Zweig and Campbell, 1993). If a test’s results do not differ between alternative states of health, then the test has negligible accuracy; if the results do not overlap for the different health states, then the test has perfect accuracy. Most test accuracies fall between these two extremes. It’s important to recognize that a test result is not a true representation of the patient’s condition (Sox et al., 1989). Most diagnostic information is imperfect; it may influence the health care provider’s thinking, but uncertainty remains about the patient’s true condition. If the test is negative for the condition, should the health care provider assume that the patient is disease-free and thus send him or her home? If the test is positive, should the health care provider assume the patient has the condition and thus begin treatment? Finally, if the test result requires interpretation by a trained reader (e.g., a radiologist), should the health care provider seek a second interpretation?

To answer these critical questions, the health care provider needs to have information on the test’s absolute and relative capabilities and an understanding of the complex interactions between the test and the trained readers who interpret the imaging data (Beam, 1992). The health care provider must ask: How does the test perform among patients with the condition (i.e., the test’s sensitivity)? How does the test perform among patients without the condition (i.e., the test’s specificity)? Does the test serve as a replacement for an older test or should multiple tests be performed? If multiple tests are performed, how should they be executed (i.e., sequentially or in parallel)? How reproducible are interpretations by different readers? These sorts of questions are addressed in diagnostic test accuracy studies.

Diagnostic test accuracy studies have three common features: a sample of subjects who have, or will, undergo one or more of the diagnostic medical tests under evaluation; some form of interpretation or scoring of the test’s findings; and a reference, or gold standard, to which the test findings are compared. This may sound simple enough, but diagnostic accuracy studies are difficult to design. Here are three common misperceptions about diagnostic test accuracy.

The first misperception involves the interpretation of diagnostic tests. Investigators of new diagnostic tests sometimes develop criteria for interpreting their tests based only on the findings from healthy volunteers. For example, in a new test to detect pancreatitis, investigators measure the amount of a certain enzyme in healthy volunteers. A typical decision criterion, or cutpoint, is three standard deviations (SDs) below the mean of the normals. New patients with an enzyme level of three SDs below the mean of the healthy volunteers are labeled “positive” for pancreatitis; patients with enzyme levels above this cutpoint are labeled “negative”. In proposing such a criterion, investigators fail to recognize (1) the relevance of the natural distributions of the test results (i.e. are they really Gaussian [normal]?); (2) the magnitude of any overlap between the test results of patients with and without pancreatitis (i.e. are the test results from most pancreatitis patients 3 SDs below the mean?); (3) the clinical significance of diagnostic errors (i.e. falsely labeling a patient without pancreatitis as “positive” for the condition and falsely labeling a patient with pancreatitis as “negative”); and (4) the poor generalization of results from studies based on healthy volunteers (i.e. healthy volunteers may have very different enzyme levels than sick patients without pancreatitis who might undergo the test). In Chapter 2, we discuss factors involved in determining optimal cutpoints for diagnostic tests; in Chapter 4, we discuss methods of finding optimal cutpoints and estimating diagnostic errors associated with them.

Another common misperception in diagnostic test studies is the notion that a rigorous assessment of a patient’s true condition - with the exclusion of patients for whom a less rigorous assessment was made - allows for a scientifically sound study. An example comes from literature on the use of ventilation-perfusion lung scans for diagnosing pulmonary emboli. The ventilation-perfusion lung scan is a noninvasive test used to screen high-risk patients for pulmonary emboli; its accuracy in various populations is unknown. Pulmonary angiography, on the other hand, is a highly accurate but invasive test. It is often used as a reference for assessing the accuracy of other tests. (See Chapter 2 for the definition and examples of gold standards.) To assess the accuracy of ventilation-perfusion lung scans, patients who have undergone both a ventilation-perfusion lung scan and a pulmonary angiogram are recruited, while patients who did not undergo the angiogram are excluded. Such a design usually leads to biased estimates of test accuracy. The reason is that the study sample is not representative of the patient population undergoing ventilation-perfusion lung scans - rather, patients with a positive scan are often recommended for angiograms, while patients with a negative scan are often not sent for an angiogram because of the risk of complications with it. In Chapter 3, we define work-up bias, and its most common form, verification bias, as well as strategies to avoid them. In Chapter 10, we present statistical methods developed specifically to correct for verification bias.

A third error common in diagnostic test accuracy studies involves confusion between accuracy and agreement. Investigators sometimes draw incorrect conclusions about a new test’s diagnostic accuracy because it agrees well with a conventional test; however, what if the new and conventional tests do not agree? We cannot simply conclude that the new test has inferior accuracy. In fact, a new test with superior accuracy will definitely disagree sometimes with the conventional test. Similarly, the two tests may have the same accuracy but make mistakes on different patients, resulting in poor agreement. A more valid approach to assessing a new test’s diagnostic accuracy is to compare both tests against a gold standard reference. Assessment of diagnostic accuracy is usually more difficult than assessment of agreement, but it is a more relevant and valid approach (Zweig and Campbell, 1993). In Chapter 5, we present methods for comparing the accuracy of two tests when the true diagnoses of the patients are known; in Chapter 11 we present methods for comparing two tests’ accuracies when the true diagnoses are unknown.

There is no question that studies of diagnostic test accuracy are challenging to design and require specialized statistical methods for their analysis. We will present and illustrate concepts and methods for designing, analyzing, interpreting, and reporting studies of diagnostic test accuracy. In Part I (Chapters 2-7) we define various measures of diagnostic accuracy, describe strategies for designing diagnostic accuracy studies, and present the basic statistical methods for estimating and comparing test accuracy, calculating sample size, and synthesizing the literature for meta-analysis. In Part II (Chapters 8-12) we present more advanced statistical methods for describing a test’s accuracy when patient characteristics affect it, for analyzing multi-reader studies, studies with verification bias or imperfect gold standards, and for performing meta-analyses.

1.2 CASE STUDIES

We introduce three diagnostic test accuracy studies to illustrate the kinds of designs, questions, and statistical issues that arise in diagnostic medicine. These case studies, along with many other examples, are used throughout the book to illustrate various statistical methods. The datasets for these case studies are given in Appendix at the end of the book.

1.2.1 Case Study 1: Parathyroid Disease

Parathyroid glands are small endocrine glands usually located in the neck or upper chest that produce a hormone that controls the body’s calcium levels. Most people have four parathyroid glands. In the most common form of parathyroid disease, one of these glands grows into a benign tumor, called a parathyroid adenoma, which produces excess amounts of parathyroid hormone. In a less common condition, called parathyroid hyperplasia, all four parathyroid glands become enlarged and secrete excess parathyroid hormone. In both conditions, a patient’s serum calcium levels become elevated, and the patient experiences loss of energy, depression, kidney stones, and headaches. Surgical removal of the offending parathyroid lesion is considered curative in most cases.

Single photon emission computed tomography (SPECT) using the radiopharmaceutical Tc-99m sestamibi is a nuclear medicine imaging test used to detect and localize parathyroid lesions prior to surgical intervention. In this prospective study (Donald Neumann, MD, PhD, Cleveland Clinic, Ohio, personal communication, 2007), 61 consecutive patients with hyperparathyroidism were imaged using a hybrid SPECT /CT instrument in an attempt to localize the diseased parathyroid glands preoperatively. Each patient underwent SPECT imaging, both with and without attenuation correction, as well as SPECT combined with CT imaging. Following imaging, the patients went to surgery to remove the diseased glands. The goal of the study was to compare the accuracy of these three tests.

One expert nuclear radiologist, blinded to the surgical findings, interpreted the images. On the SPECT imaging, each gland was scored on a scale from 1-7, with 1=definitely no disease, 2=probably no disease, 3=indeterminate, 4=maybe diseased, and 5=definitely diseased. Scores of 5, 6, and 7 were all considered definitely diseased but were distinguished by the intensity of the attenuation: 5=low, 6=medium, and 7=high, respectively. The SPECT/CT images were scored using just the 1-5 part of the scale. For this study, SPECT images scored as 1-3 were considered negative and scores of 4-7 as positive. For SPECT/CT, scores of 1-3 were considered negative and scores of 4-5 were considered positive. 97 glands in 61 patients were localized by imaging prior to undergoing parathyroid surgery, the results of which were considered the gold standard.

The investigators wanted to compare the sensitivity and specificity of these three tests to determine which single test should be used for future patients. In Chapter 2, we show that one of these tests appears more sensitive than the others, while another test appears more specific. A comparison of the tests’ Receiver Operating Characteristic (ROC) curves gives us a complete understanding of the strengths and weaknesses of the three tests and thus allows us to identify the most suitable test for preoperative patients.

The data from this study are complicated by the fact that many of the 61 patients had multiple glands visualized at screening, so called “clustered data.” Observations from the same patient, even if from different glands, are usually correlated, at least to some small degree. If we ignore this correlation, then the resulting confidence intervals and p-values can be misleading. In Chapters 4 and 5, we describe a simple analysis method that can be used for clustered data so that confidence intervals and p-values are correct.

1.2.2 Case Study 2: Colon Cancer Detection

Polyps that form in the colon or rectum can progress to cancer without any signs or symptoms. Computed tomography colonography (CTC) is an imaging test that can detect polyps before they develop into cancer. Radiologists sometimes overlook polyps on the CTC images, however, and these missed polyps (“false negatives”) can develop into cancer, which can lead to symptoms, even death. Investigators have developed a computer algorithm, called computer aided detection (CAD), to help radiologists detect polyps on the CTC. The CAD utilizes tissue intensity, volumetric and surface shape, and texture characteristics to identify suspicious areas. The CAD marks the suspicious areas for the reader to exam more closely. Often, the CAD identifies multiple suspicious areas on the same image. The radiologist must distinguish marked areas that contain a polyp (“true positive”) from marked areas that do not contain a polyp, for example a folded bowel lining (“false positive”).

In this study (Baker et al., 2007), the investigators wanted to compare radiologists’ accuracy without CAD to their accuracy with CAD to determine if CAD improves radiologists’ accuracy. Seven radiologists from two institutions participated in the study. The readers had varying levels of overall experience with abdominal imaging, as well as varying levels of training with CTC imaging technology. Overall, the 7 were considered inexperienced CTC readers.

Two hundred seventy patients from six institutions were compiled in this retrospective design. These 270 patients had undergone CTC for the following reasons: screening, follow-up exams for polyps detected in a prior exam, and failed prior colonoscopy including patients at risk for colon polyps/carcinoma but who were deemed not suitable candidates for a colonoscopy. An expert abdominal imager with extensive CTC experience and with knowledge about each patient’s follow-up (clinical, imaging, pathologic, and surgical), stratified the 270-patient sample into presence versus absence of a polyp; cases with a polyp were further stratified by polyp size (less than 6 mm “small”, 6-9 mm “medium”, or 10 mm or larger “large”). One hundred forty-one training cases were randomly sampled from the different strata to improve the CAD algorithm and train the readers. From the remaining 119 test cases, 30 were randomly selected to be used in this reader performance study; the study sample was composed of 25 positive cases with at least one polyp of middle to large size (a total of 39 polyps) and five cases with no polyps.

The seven readers were each given a unique order for reading the 30 images. First without CAD, the reader marked all findings. The reader used a pulldown window to identify the location of each finding according to one of eight colon segments. The reader then scored each finding according to their confidence that a polyp was present: !=definitely not a polyp; 2=probably not a polyp; 3=indeterminate; 4=maybe a polyp; and, 5=definitely a polyp. When the reader’s interpretation without CAD was completed, the reader was given a list of potential polyps detected by the CAD. Any CAD marks that coincided with a lesion found by the reader without CAD were not presented to the reader and were discarded. New CAD marks were scored by the reader using the 1-5 rating scale. The investigators in this study want to know if the CAD improves inexperienced radiologists’ accuracy over their accuracy without CAD (“unaided setting”). The seven-reader design helps us to get a better estimate of reader accuracy, but also complicates the analyses because the readers’ findings are correlated by the fact that they all interpreted the same sample of 30 patients. Sensitivity was defined for this study as correct detection of a polyp in a patient with polyps, and, in addition, required that the reader identify the correct location of the polyp. If the wrong location was chosen, then the missed polyp was considered a false negative. In this study patients can have multiple true positives (i.e. multiple correctly located polyps in the same patient), as well as a mixture of true positive, false positive, false negative, and true negative findings. Clustered data complicates the statistical analyses, but statistical methods are presented in Chapters 4 and 5 to handle these data appropriately.

1.2.3 Case Study 3: Carotid Artery Stenosis

Excessive plaque formation, or stenosis, in the carotid (neck) artery can lead to transient ischemic attacks (TIAs) or even stroke. Conventional catheter angiography is an invasive diagnostic test used by physicians to examine the carotid arteries in patients who have suffered a TIA or stroke. Because the test is invasive, there are risks associated with the test including stroke and death. Magnetic Resonance Angiography (MRA) is a non-invasive test that may help physicians examine the carotid arteries without risk. Patients with other cardiovascular problems who are at high risk for plaque formation in the carotid arteries can also benefit from such a noninvasive screening test.

In this study, investigators (Thomas Masaryk, MD, Cleveland Clinic, Ohio, personal communication, 2007) wanted to assess the accuracy of MRA for detecting carotid artery plaque. Patients scheduled for a conventional catheter angiogram because they had suffered a recent stroke (symptomatic) or because they were at high risk for suffering a stroke in the future (asymptomatic) were asked to participate in this study. One hundred sixty-three patients were prospectively recruited for the study. These patients first underwent an MRA, then a conventional catheter angiogram.

Four radiologists from three institutions independently interpreted the conventional catheter angiograms, and the same four radiologists independently interpreted the MRA images. At least two weeks passed between the catheter angiogram and MRA interpretations; the study ID numbers were changed so that there was no obvious connection between the catheter angiogram and MRA images.

A significant stenosis requiring surgical intervention was defined as stenosis that blocked 60-99 percent of the carotid vessel. Note that arteries that are completely blocked (100 percent stenosis, or occlusions) are not considered good surgical lesions. The radiologists were asked to grade their confidence that a significant stenosis was present using a 5-point scale: !=definitely no significant stenosis, 2=probably no significant stenosis, 3=equivocal, 4=probably significant stenosis, and 5=definitely significant stenosis. They were also asked to indicate the percent of stenosis present (a number between 0 and 100). The radiologists responded to these questions for both the left and right sides for both MRA and conventional catheter angiography.

In this study the investigators want to know the accuracy of MRA and whether or not it can replace the conventional invasive test, catheter angiography. The data are complicated by the multiple-reader design, as well as by the fact that the data are clustered (i.e. findings from both the left and right carotid arteries in the same patient). There are several patient characteristics, such as gender, age, and symptoms, which the investigators suspect may affect the accuracy of MRA. In Chapter 3, we discuss the kinds of effects that covariates can have on diagnostic test accuracy; in Chapter 8, we discuss various regression methods to handle covariate data. Finally, we note that the gold standard for this study, catheter angiography, is not a perfect test and radiologists often disagree in their interpretations of its findings. Fortunately, there are statistical methods, which we describe in Chapter 11, that deal with studies with imperfect reference standards.

1.3 SOFTWARE

A variety of software has been written to implement many of the statistical methods discussed in this book. These programs can be found in FORTRAN, SAS macros (SAS Institute, Cary, North Carolina, USA), Stata (Stata Data Analysis and Statistical Software, Stata Corp LP, College Station, Texas), and R (free software at http:/jwww.r-project.org/). The authors have prepared a Web site (http://faculty.washington.edu/azhou/books/diagnostic.html) that contains links to some useful software. The web site will be maintained and updated periodically for at least five years after this book’s publication date.

1.4 TOPICS NOT COVERED IN THIS BOOK

Although this book covers the main themes in statistical methods for diagnostic medicine, it does not cover several related topics, as follows.

Decision analysis, cost-effectiveness analysis, and cost-benefit analysis are methods commonly used to quantify the long-term, or downstream, effects of a test on the patient and society. In Chapters 2 and 4, we discuss how these methods can be applied to find the optimal cutpoint on the ROC curve. Description of how to perform these methods, however, is beyond the scope of this book. There are many excellent references on these topics, including (Gold et al., 1996), (Pauker and Kassirer, 1975), (Russell et al., 1996), (Weinstein et al., 1996), (Drummond et al., 2005), (Glick et al., 2007), and (Willan and Briggs, 2006).

Most of the methods we present for estimation and hypothesis testing are from a frequentist perspective. Bayesian methods can also be used, whereby one incorporates into the assessment of the diagnostic test some previously acquired information or expert opinion about a test’s characteristics or information about the patient or population. Examples of Bayesian methods used in diagnostic testing include Gatsonis (1995); Joseph et al. (1995); Peng and Hall (1996); Hellmich et al. (1988); O’Malley and Zou (2001); Broemeling (2007).

Finally, when there are multiple diagnostic tests performed on a patient, we may want to combine the information from the tests in order to make the best possible diagnosis. See, for example, Pepe and Thompson (2000), Zhou et al. (2011), and Lin et al. (2011) for various methods for combining tests’ results to optimize diagnostic accuracy.

CHAPTER 2 MEASURES OF DIAGNOSTIC ACCURACY

In this chapter we describe various measures of the accuracy of diagnostic tests. We begin by introducing measures of intrinsic accuracy, a test’s inherent ability to correctly detect a condition when it is actually present and to correctly rule out a condition when it is truly absent. These attributes are considered fundamental to the tests themselves. They do not change for different samples of patients with different prevalence rates of disease. It is important to recognize, however, which these attributes can change somewhat over time and population as the technical specifications of the imaging machine, the clinician interpreting the test, and the characteristics of the patient (e.g. severity of disease) change.

The intrinsic accuracy of a test is measured by comparing the test results to the true condition status of the patient. We assume for most of our discussion that the true condition status is one of two mutually exclusive states: “the condition is present” or “the condition is absent Some examples are the presence versus the absence of parathyroid disease, the presence of a malignant versus benign tumor, and the presence of one versus more than one tumor. We determine the true condition status by means of a gold standard. A gold standard is a source of information, completely different from the test or tests under evaluation, which tells us the true condition status of the patient. Different gold standards are used for different tests and applications; some common examples are autopsy reports, surgery findings, pathology results from biopsy specimens, and the results of other diagnostic tests that have perfect or near perfect accuracy. In Chapter 3, we discuss more about the selection of a gold standard; in Chapter 11 we present statistical methods for measuring diagnostic accuracy without a gold standard.

Lesen Sie weiter in der vollständigen Ausgabe!

Tausende von E-Books und Hörbücher

Ihre Zahl wächst ständig und Sie haben eine Fixpreisgarantie.

Sie haben über uns geschrieben: