Multiple Imputation and its Application - James Carpenter - E-Book

Multiple Imputation and its Application E-Book

James Carpenter

0,0
61,99 €

-100%
Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.
Mehr erfahren.
Beschreibung

A practical guide to analysing partially observed data.

Collecting, analysing and drawing inferences from data is central to research in the medical and social sciences. Unfortunately, it is rarely possible to collect all the intended data. The literature on inference from the resulting incomplete  data is now huge, and continues to grow both as methods are developed for large and complex data structures, and as increasing computer power and suitable software enable researchers to apply these methods.

This book focuses on a particular statistical method for analysing and drawing inferences from incomplete data, called Multiple Imputation (MI). MI is attractive because it is both practical and widely applicable. The authors aim is to clarify the issues raised by missing data, describing the rationale for MI, the relationship between the various imputation models and associated algorithms and its application to increasingly complex data structures.

Multiple Imputation and its Application:

  • Discusses the issues raised by the analysis of partially observed data, and the assumptions on which analyses rest.
  • Presents a practical guide to the issues to consider when analysing incomplete data from both observational studies and randomized trials.
  • Provides a detailed discussion of the practical use of MI with real-world examples drawn from medical and social statistics.
  • Explores handling non-linear relationships and interactions with multiple imputation, survival analysis, multilevel multiple imputation, sensitivity analysis via multiple imputation, using non-response weights with multiple imputation and doubly robust multiple imputation.

Multiple Imputation and its Application is aimed at quantitative researchers and students in the medical and social sciences with the aim of clarifying the issues raised by the analysis of incomplete data data, outlining the rationale for MI and describing how to consider and address the issues that arise in its application.

Sie lesen das E-Book in den Legimi-Apps auf:

Android
iOS
von Legimi
zertifizierten E-Readern

Seitenzahl: 576

Veröffentlichungsjahr: 2012

Bewertungen
0,0
0
0
0
0
0
Mehr Informationen
Mehr Informationen
Legimi prüft nicht, ob Rezensionen von Nutzern stammen, die den betreffenden Titel tatsächlich gekauft oder gelesen/gehört haben. Wir entfernen aber gefälschte Rezensionen.



Table of Contents

Statistics in Practice

Title Page

Copyright

Preface

Data Acknowledgements

Acknowledgements

Glossary

Part I: Foundations

Chapter 1: Introduction

1.1 Reasons for missing data

1.2 Examples

1.3 Patterns of missing data

1.4 Inferential framework and notation

1.5 Using observed data to inform assumptions about the missingness mechanism

1.6 Implications of missing data mechanisms for regression analyses

1.7 Summary

Chapter 2: The Multiple Imputation Procedure and Its Justification

2.1 Introduction

2.2 Intuitive outline of the MI procedure

2.3 The generic MI procedure

2.4 Bayesian justification of MI

2.5 Frequentist inference

2.6 Choosing the number of imputations

2.7 Some simple examples

2.8 MI in more general settings

2.9 Constructing congenial imputation models

2.10 Practical considerations for choosing imputation models

2.11 Discussion

Part III: Advanced Topics

Chapter 3: Multiple Imputation of Quantitative Data

3.1 Regression imputation with a monotone missingness pattern

3.2 Joint modelling

3.3 Full conditional specification

3.4 Full conditional specification versus joint modelling

3.5 Software for multivariate normal imputation

3.6 Discussion

Chapter 4: Multiple Imputation of Binary and Ordinal Data

4.1 Sequential imputation with monotone missingness pattern

4.2 Joint modelling with the multivariate normal distribution

4.3 Modelling binary data using latent normal variables

4.4 General location model

4.5 Full conditional specification

4.6 Issues with over-fitting

4.7 Pros and cons of the various approaches

4.8 Software

4.9 Discussion

Chapter 5: Multiple Imputation of Unordered Categorical Data

5.1 Monotone missing data

5.2 Multivariate normal imputation for categorical data

5.3 Maximum indicant model

5.4 General location model

5.5 FCS with categorical data

5.6 Perfect prediction issues with categorical data

5.7 Software

5.8 Discussion

Chapter 6: Nonlinear Relationships

6.1 Passive imputation

6.2 No missing data in nonlinear relationships

6.3 Missing data in nonlinear relationships

6.4 Discussion

Chapter 7: Interactions

7.1 Interaction variables fully observed

7.2 Interactions of categorical variables

7.3 General nonlinear relationships

7.4 Software

7.5 Discussion

Part III: Advanced Topics

Chapter 8: Survival Data, Skips and Large Datasets

8.1 Time-to-event data

8.2 Nonparametric, or ‘hot deck’ imputation

8.3 Multiple imputation for skips

8.4 Two-stage MI

8.5 Large datasets

8.6 Multiple imputation and record linkage

8.7 Measurement error

8.8 Multiple imputation for aggregated scores

8.9 Discussion

Chapter 9: Multilevel Multiple Imputation

9.1 Multilevel imputation model

9.2 MCMC algorithm for imputation model

9.3 Imputing level-2 covariates using FCS

9.4 Individual patient meta-analysis

9.5 Extensions

9.6 Discussion

Chapter 10: Sensitivity Analysis: MI Unleashed

10.1 Review of MNAR modelling

10.2 Framing sensitivity analysis

10.3 Pattern mixture modelling with MI

10.4 Pattern mixture approach with longitudinal data via MI

10.5 Piecing together post-deviation distributions from other trial arms

10.6 Approximating a selection model by importance weighting

10.7 Discussion

Chapter 11: Including Survey Weights

11.1 Using model based predictions

11.2 Bias in the MI variance estimator

11.3 A multilevel approach

11.4 Further developments

11.5 Discussion

Chapter 12: Robust Multiple Imputation

12.1 Introduction

12.2 Theoretical background

12.3 Robust multiple imputation

12.4 Simulation studies

12.5 The RECORD study

12.6 Discussion

Appendix A: Markov Chain Monte Carlo

Appendix B: Probability Distributions

B.1 Posterior for the multivariate normal distribution

Bibliography

Index of Authors

Index of Examples

Index

Statistics in Practice

Statistics in Practice

Series Advisory Editors

Marian Scott

University of Glasgow, UK

Stephen Senn

CRP-Santé, Luxembourg

Wolfgang Jank

University of Maryland, USA

Founding Editor

Vic Barnett

Nottingham Trent University, UK

Statistics in Practice is an important international series of texts which provide detailed coverage of statistical concepts, methods and worked case studies in specific fields of investigation and study.

With clear explanations and many worked practical examples, the books show in down-to-earth terms how to select and use an appropriate range of statistical techniques in a particular practical field within each title's special topic area.

The books provide statistical support for professionals and research workers across a range of fields of employment and research environments. Subject areas covered include medicine and pharmaceuticals; industry, finance and commerce; public services, and the earth and environmental sciences.

The books also provide support to students studying applied statistics courses in these areas. The demand for applied statistics graduates in these areas has led to such courses becoming increasingly prevalent at universities and colleges.

It is our aim to present judiciously chosen and well-written textbooks to meet everyday practical needs. Feedback from readers will be valuable in monitoring our success.

This edition first published 2013

© 2013 John Wiley & Sons, Ltd

Registered office

John Wiley & Sons Ltd, The Atrium, Southern Gate, Chichester, West Sussex, PO19 8SQ, United Kingdom

For details of our global editorial offices, for customer services and for information about how to apply for permission to reuse the copyright material in this book please see our website at www.wiley.com.

The right of the author to be identified as the author of this work has been asserted in accordance with the Copyright, Designs and Patents Act 1988.

All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, recording or otherwise, except as permitted by the UK Copyright, Designs and Patents Act 1988, without the prior permission of the publisher.

Wiley also publishes its books in a variety of electronic formats. Some content that appears in print may not be available in electronic books.

Designations used by companies to distinguish their products are often claimed as trademarks. All brand names and product names used in this book are trade names, service marks, trademarks or registered trademarks of their respective owners. The publisher is not associated with any product or vendor mentioned in this book. This publication is designed to provide accurate and authoritative information in regard to the subject matter covered. It is sold on the understanding that the publisher is not engaged in rendering professional services. If professional advice or other expert assistance is required, the services of a competent professional should be sought.

Library of Congress Cataloging-in-Publication Data

Carpenter, James R.

Multiple imputation and its application / James R. Carpenter, Michael G. Kenward. — 1st ed.

p. ; cm.

Includes bibliographical references and index.

ISBN 978-0-470-74052-1 (hardback)

I. Kenward, Michael G., 1956- II. Title.

[DNLM: 1. Data Interpretation, Statistical. 2. Biomedical Research – methods. WA 950]

610.72′4–dc23

2012028821

A catalogue record for this book is available from the British Library.

ISBN: 978-0-470-74052-1

Cover photograph courtesy of Harvey Goldstein

Preface

No study of any complexity manages to collect all the intended data. Analysis of the resulting partially collected data must therefore address the issues raised by the missing data. Unfortunately, the inferential consequences of missing data are not simply restricted to the proportion of missing observations. Instead, the interplay between the substantive questions and the reasons for the missing data is crucial. Thus, there is no simple, universal, solution.

Suppose, for the substantive question at hand, the inferential consequences of missing data are nontrivial. Then the analyst must make a set of assumptions about the reasons, or mechanisms, causing data to be missing, and perform an inferentially valid analysis under these assumptions. In this regard, analysis of a partially observed dataset is the same as any statistical analysis; the difference is that when data are missing we cannot assess the validity of these assumptions in the way we might do in a regression analysis, for example. Hence, sensitivity analysis, where we explore the robustness of inference to different assumptions about the reasons for missing data, is important.

Given a set of assumptions about the reasons data are missing, there are a number of statistical methods for carrying out the analysis. These include the EM algorithm, inverse probability weighting, a full Bayesian analysis and, depending on the setting, a direct application of maximum likelihood. These methods, and those derived from them, each have their own advantages in particular settings. Nevertheless, we argue that none shares the practical utility, broad applicability and relative simplicity of Rubin's Multiple Imputation (MI).

Following an introductory chapter outlining the issues raised by missing data, the focus of this book is therefore MI. We outline its theoretical basis, and then describe its application to a range of common analysis in the medical and social sciences, reflecting the wide application that MI has seen in recent years. In particular, we describe its application with nonlinear relationships and interactions, with survival data and with multilevel data. The last three chapters consider practical sensitivity analyses, combining MI with inverse probability weighting, and doubly robust MI.

Self-evidently, a key component of an MI analysis, is the construction of an appropriate method of imputation. There is no unique, ideal, way in which this should be done. In particular, there there has been some discussion in the literature about the relative merits of the joint modelling and full conditional specification approaches. We have found that thinking in terms of joint models is both natural and convenient for formulating imputation models, a range of which can then be (approximately) implemented using a full conditional specification approach. Differences in computational speed between joint modelling and full conditional specification are generally due to coding efficiency, rather than intrinsic superiority of one method over the other.

Throughout the book we illustrate the ideas with several examples. The code used for these examples, in various software packages, is available from the book's home page, which is at http://www.wiley.com/go/multiple_imputation, together with exercises to go with each chapter.

We welcome feedback from readers; any comments and corrections should be e-mailed to [email protected]. Unfortunately, we cannot promise to respond individually to each message.

Data Acknowledgements

We are grateful to the following:

AstraZeneca for permission to use data from the 5-arm asthma study in examples in Chapters 1, 3, 7 and 10;
GlaxoSmithKline for permission to use data from the dental pain study in Chapter 4, and the RECORD study in Chapter 12;
Mike English (Director, Child and Newborn Health Group, Kemri-Wellcome Trust Research Programme, Nairobi, Kenya) for permission to use data from a multifaceted intervention to implement guidelines and improve admission paediatric care in Kenyan district hospitals, in Chapter 9;
Peter Blatchford for permission to use data from the Class Size Study (Blatchford et al, 2002) in Chapter 9, and
Sarah Schroter for permission to use data from the study to improve the quality of peer review in Chapter 10.

In Chapters 1, 5, 8, 10 and 11 we have analysed data from the Youth Cohort Time Series for England, Wales and Scotland, 1984-2002 First Edition, Colchester, Essex, published by and freely available from the UK Data Archive, Study Number SN 5765. We thank Vernon Gayle for introducing us to these data.

In Chapter 6 we have analysed data from the Alzheimer's Disease Neuro-imaging Initiative (ADNI) database (adni.loni.ucla.edu). As such, the investigators within the ADNI contributed to the design and implementation of ADNI and/or provided data but did not participate in the analysis or in the writing of this book. A complete listing of ADNI investigators can be found at: http://adni.loni.ucla.edu/wp-content/uploads/how_to_apply/ADNI_Acknowledgement_List.pdf.

The ADNI was launched in 2003 by the National Institute on Aging (NIA), the National Institute of Biomedical Imaging and Bioengineering (NIBIB), the Food and Drug Administration (FDA), private pharmaceutical companies and nonprofit organisations, as a $60 million, five-year public-private partnership. The primary goal of ADNI has been to test whether serial magnetic resonance imaging (MRI), positron emission tomography (PET), other biological markers, and clinical and neuropsychological assessment can be combined to measure the progression of mild cognitive impairment (MCI) and early Alzheimer's disease (AD). Determination of sensitive and specific markers of very early AD progression is intended to aid researchers and clinicians to develop new treatments and monitor their effectiveness, as well as lessen the time and cost of clinical trials.

The Principal Investigator of this initiative is Michael W. Weiner, MD, VA Medical Center and University of California San Francisco. ADNI is the result of the efforts of many co-investigators from a broad range of academic institutions and private corporations, and subjects have been recruited from over 50 sites across the US and Canada. The initial goal of ADNI was to recruit 800 adults, ages 55 to 90, to participate in the research, with approximately 200 cognitively normal older individuals to be followed for 3 years, 400 people with MCI to be followed for 3 years and 200 people with early AD to be followed for 2 years. For up-to-date information, see www.adni-info.org.

Data collection and sharing for this project was funded by the Alzheimer's Disease Neuroimaging Initiative (ADNI) (National Institutes of Health Grant U01 AG024904). ADNI is funded by the National Institute on Aging, the National Institute of Biomedical Imaging and Bioengineering, and through generous contributions from the following: Abbott; Alzheimer's Association; Alzheimer's Drug Discovery Foundation; Amorfix Life Sciences Ltd.; AstraZeneca; Bayer HealthCare; BioClinica, Inc.; Biogen Idec, Inc.; Bristol-Myers Squibb Company; Eisai, Inc.; Elan Pharmaceuticals, Inc.; Eli Lilly and Company; F. Hoffmann-La Roche Ltd and its affiliated company Genentech, Inc.; GE Healthcare; Innogenetics, N.V.; IXICO Ltd.; Janssen Alzheimer Immunotherapy Research & Development, LLC; Johnson & Johnson Pharmaceutical Research & Development, LLC; Medpace, Inc.; Merck & Co., Inc.; Meso Scale Diagnostics, LLC; Novartis Pharmaceuticals Corporation; Pfizer, Inc.; Servier; Synarc, Inc.; and Takeda Pharmaceutical Company. The Canadian Institutes of Health Research is providing funds to support ADNI clinical sites in Canada. Private sector contributions are facilitated by the Foundation for the National Institutes of Health (www.fnih.org). The grantee organisation is the Northern California Institute for Research and Education, and the study is coordinated by the Alzheimer's Disease Cooperative Study at the University of California, San Diego. ADNI data are disseminated by the Laboratory for Neuro-imaging at the University of California, Los Angeles. This research was also supported by NIH grants P30 AG010129 and K01 AG030514.

In Chapter 7 we have analysed data from the 1958 National Childhood Development Study. This is published, and freely available from the UK Data Archive, Study Number SN 5565 (waves 0–3) and SN 5566 (wave 4). We thank Ian Plewis for introducing us to these data.

Acknowledgements

No book of this kind is written in a vacuum, and we are grateful to many friends and colleagues for research collaborations, stimulating discussions and comments on draft chapters.

In particular we would like to thank members of the Missing Data Imputation and Analysis (MiDIA) group, including (in alphabetical order) Jonathan Bartlett, John Carlin, Rhian Daniel, Dan Jackson, Shaun Seaman, Jonathan Sterne, Kate Tilling and Ian White.

We would also like to acknowledge many years of collaboration with Geert Molenberghs, James Roger and Harvey Goldstein.

James would like to thank Mike Elliott, Rod Little, Trivellore Raghunathan and Jeremy Taylor for facilitating a visit to the Institute for Social Research and Department of Biostatistics at the University of Michigan, Ann Arbor, in Summer 2011, when the majority of the first draft was written.

Thanks to Tim Collier for the anecdote in §1.3.

We also gratefully acknowledge funding support from the ESRC (3-year fellowship for James Carpenter, RES-063-27-0257, and follow-on funding RES-189-25-0103) and MRC (grants G0900724, G0900701 and G0600599).

We would also like to thank Richard Davies and Kathryn Sharples at Wiley for their encouragement and support.

Lastly, thanks to our families for their forbearance and understanding over the course of this project.

Despite the encouragement and support of those listed above, the text inevitably contains errors and shortcomings, for which we take full responsibility.

James Carpenter and Mike KenwardLondon School of Hygiene & Tropical Medicine

Glossary

Indices and symbols

i

indexes units, often individuals, unless defined otherwise

j

indexes variables in the data set, unless defined otherwise

n

total number of units in the data set, unless defined otherwise

p

depending on context, number of variables in a data set or number of parameters in a statistical model

X,Y,Z

random variables

Y

i,j

i

th

observation on

j

th

variable,

i

 = 1,…,

n

,

j

 = 1,…,

p

.

θ

generic parameter

θ

generic parameter column vector, typically

p

 × 1

β, γ, δ

regression coefficients

β

column vector of regression coefficients, typically

p

 × 1.

Matrices

Ω

matrix, typically of dimension

p

×

p

.

Ω

i,j

i, j

th

element of Ω

Ω

T

transpose of

Ω

so that .

Y

j

 = (

Y

1,

j

,…,

Y

n,j

)

T

n

 × 1 column vector of observations on variable

j

.

tr(

Ω

)

sum of diagonal elements of

Ω

,

ie

known as the trace of the matrix.

Abbreviations

AIPW

Augmented Inverse Probability Weighting

CAR

Censoring At Random

CNAR

Censoring Not At Random

EM

Expectation Maximisation

FCS

Full Conditional Specification

FEV

1

Forced Expiratory Volume in 1 second (measured in litres)

FMI

Fraction of Missing Information

IPW

Inverse Probability Weighting

MAR

Missing At Random

MCAR

Missing Completely At Random

MI

Multiple Imputation

MNAR

Missing Not At Random

POD

Partially Observed Data

POM

Probability Of Missingness

S.E.

Standard error

Probability distributions

f

(.)

probability distribution function

F

(.)

cumulative distribution function

‘|’

to be verbalised ‘given’, as in

f

(

Y

|

X

) ‘the probability distribution function of

Y

given

X

Part I

Foundations

Chapter 1

Introduction

Collecting, analysing and drawing inferences from data are central to research in the medical and social sciences. Unfortunately, for any number of reasons, it is rarely possible to collect all the intended data. The ubiquity of missing data, and the problems this poses for both analysis and inference, has spawned a substantial statistical literature dating from 1950s. At that time, when statistical computing was in its infancy, many analyses were only feasible because of the carefully planned balance in the dataset (for example, the same number of observations on each unit). Missing data meant the available data for analysis were unbalanced, thus complicating the planned analysis and in some instances rendering it unfeasible. Early work on the problem was therefore largely computational (e.g. Healy and Westmacott, 1956; Afifi and Elashoff, 1966; Orchard and Woodbury, 1972; Dempster et al., 1977).

The wider question of the consequences of nontrivial proportions of missing data for inference was neglected until a seminal paper by Rubin (1976). This set out a typology for assumptions about the reasons for missing data, and sketched their implications for analysis and inference. It marked the beginning of a broad stream of research about the analysis of partially observed data. The literature is now huge, and continues to grow, both as methods are developed for large and complex data structures, and as increasing computer power and suitable software enable researchers to apply these methods.

For a broad overview of the literature, a good place to start is one of the recent excellent textbooks. Little and Rubin (2002) write for applied statisticians. They give a good overview of likelihood methods, and give an introduction to multiple imputation. Allison (2002) presents a less technical overview. Schafer (1997) is more algorithmic, focusing on the EM algorithm and imputation using the multivatiate normal and general location model. Molenberghs and Kenward (2007) focus on clinical studies, while Daniels and Hogan (2008) focus on longitudinal studies with a Bayesian emphasis.

The above books concentrate on parametric approaches. However, there is also a growing literature based around using inverse probability weighting, in the spirit of Horvitz and Thompson (1952), and associated doubly robust methods. In particular, we refer to the work of Robins and colleagues (e.g. Robins et al., 1995; Scharfstein et al., 1999). Vansteelandt et al. (2009) give an accessible introduction to these developments. A comparison with multiple imputation in a simple setting is given by Carpenter et al. (2006). The pros and cons are debated in Kang and Schafer (2007) and the theory is brought together by Tsiatis (2006).

This book is concerned with a particular statistical method for analysing and drawing inferences from incomplete data, called Multiple Imputation (MI). Initially proposed by Rubin (1987) in the context of surveys, increasing awareness among researchers about the possible effects of missing data (e.g. Klebanoff and Cole, 2008) has led to an upsurge of interest (e.g. Sterne et al., 2009; Kenward and Carpenter, 2007; Schafer, 1999a; Rubin, 1996).

Multiple imputation (MI) is attractive because it is both practical and widely applicable. Recently developed statistical software (see, for example, issue 45 of the Journal of Statistical Software) has placed it within the reach of most researchers in the medical and social sciences, whether or not they have undertaken advanced training in statistics. However, the increasing use of MI in a range of settings beyond that originally envisaged has led to a bewildering proliferation of algorithms and software. Further, the implication of the underlying assumptions in the context of the data at hand is often unclear.

We are writing for researchers in the medical and social sciences with the aim of clarifying the issues raised by missing data, outlining the rationale for MI, explaining the motivation and relationship between the various imputation algorithms, and describing and illustrating its application to increasingly complex data structures.

Central to the analysis of partially observed data is an understanding of why the data are missing and the implications of this for the analysis. This is the focus of the remainder of this chapter. Introducing some of the examples that run through the book, we show how Rubin's typology (Rubin, 1976) provides the foundational framework for understanding the implications of missing data.

1.1 Reasons for missing data

In this section we consider possible reasons for missing data, illustrate these with examples, and draw some preliminary implications for inference. We use the word ‘possible’ advisedly, since with partially observed data we can rarely be sure of the mechanism giving rise to missing data. Instead, a range of possible mechanisms are consistent with the observed data. In practice, we therefore wish to analyse the data under different mechanisms, to establish the robustness of our inference in the face of uncertainty about the missingness mechanism.

All datasets consist of a series of units each of which provides information on a series of items. For example, in a cross-sectional questionnaire survey, the units would be individuals and the items their answers to the questions. In a household survey, the units would be households, and the items information about the household and members of the household. In longitudinal studies, units would typically be individuals while items would be longitudinal data from those individuals. In this book, units therefore correspond to the highest level in multilevel (i.e., hierarchical) data, and unless stated otherwise data from different units are statistically independent.

Within this framework, it is useful to distinguish between units where all the information is missing, termed unit nonresponse and units who contribute partial information, termed item nonresponse. The statistical issues are the same in both cases, and both can in principle be handled by MI. However, the main focus of this book is the latter.

Example 1.1 Mandarin tableau
Figure 1.1, which is also shown on the cover, shows part of the frontage of a senior mandarin's house in the New Territories, Hong Kong. We suppose interest focuses on characteristics of the figurines, for example their number, height, facial characteristics and dress. Unit nonresponse then corresponds to missing figurines, and item nonresponse to damaged—hence partially observed—figurines.

Figure 1.1 Detail from a senior mandarin's house front in New Territories, Hong Kong. Photograph by H. Goldstein.

1.2 Examples

We now introduce two key examples, which we return to throughout the book.

Example 1.2 Youth Cohort Study (YCS)
The Youth Cohort Study of England and Wales (YCS) is an ongoing UK government funded representative survey of pupils in England and Wales at school-leaving age (School year 11, age 16–17) (UK Data Archive, 2007). Each year that a new cohort is surveyed, detailed information is collected on each young person's experience of education and their qualifications as well as information on employment and training. A limited amount of information is collected on their personal characteristics, family, home circumstances, and aspirations.
Over the life-cycle of the YCS, different organisations have had responsibility for the structure and timings of data collection. Unfortunately, the documentation of older cohorts is poor. Croxford et al. (2007) have recently deposited a harmonised dataset that comprises YCS cohorts from 1984 to 2002 (UK Data Archive Study Number 5765). We consider data from pupils attending comprehensive schools from five YCS cohorts; these pupils reached the end of Year 11 in 1990, 1993, 1995, 1997 and 1999.
We explore relationships between Year 11 educational attainment (the General Certificate of Secondary Education) and key measures of social stratification. The units are pupils and the items are measurements on these pupils, and a nontrivial number of items are partially observed.
Example 1.3 Randomised controlled trial of patients with chronic asthma
We consider data from a 5-arm asthma clinical trial to assess the efficacy and safety of budesonide, a second-generation glucocorticosteroid, on patients with chronic asthma. 473 patients with chronic asthma were enrolled in the 12-week randomised, double-blind, multi-centre parallel-group trial, which compared the effect of a daily dose of 200, 400, 800 or 1600 mcg of budesonide with placebo.
Key outcomes of clinical interest include patients' peak expiratory flow rate (their maximum speed of expiration in litres/minute) and their Forced Expiratory Volume, FEV1, (the volume of air, in litres, the patient with fully inflated lungs can breathe out in one second). In summary, the trial found a statistically significant dose-response effect for the mean change from baseline over the study for both morning peak expiratory flow, evening peak expiratory flow and FEV1, at the 5% level.
Budesonide treated patients also showed reduced asthma symptoms and bronchodilator use compared with placebo, while there were no clinically significant differences in treatment related adverse experiences between the treatment groups. Further details about the conduct of the trial, its conclusions and the variables collected can be found elsewhere (Busse et al., 1998). Here, we focus on FEV1 and confine our attention to the placebo and lowest active dose arms. FEV1 was collected at baseline, then 2, 4, 8 and 12 weeks after randomisation. The intention was to compare FEV1 across treatment arms at 12 weeks. However, excluding 3 patients whose participation in the study was intermittent, only 37 out of 90 patients in the placebo arm, and 71 out of 90 patients in the lowest active dose arm, still remained in the trial at twelve weeks.

1.3 Patterns of missing data

It is very important to investigate the patterns of missing data before embarking on a formal analysis. This can throw up vital information that might otherwise be overlooked, and may even allow the missing data to be traced. For example, when analysing the new wave of a longitudinal survey, a colleague's careful examination of missing data patterns established that many of the missing questionnaires could be traced to a set of cardboard boxes. These turned out to have been left behind in a move. They were recovered and the data entered.

Most statistical software now has tools for describing the pattern of missing data. Key questions concern the extent and patterns of missing values, and whether the pattern is monotone (as described in the next paragraph), as if it is, this can considerably speed up and simplify the analysis.

Missing data in a set of p variables are said to follow a monotone missingness pattern if the variables can be re-ordered such that, for every unit i and variable j,

A natural setting for the occurrence of monotone missing data is a longitudinal study, where units are observed either until they are lost to follow-up, or the study concludes. A monotone pattern is thus inconsistent with interim missing data, where units are observed for a period, missing for the subsequent period, but then observed. Questionnaires may also give rise to monotone missing data patterns when individuals systematically answer each question in turn from the beginning till they either stop or complete the questionnaire. In other settings it may be possible to re-order items to achieve a monotone pattern.

Example 1.2 Youth Cohort Study (ctd)
Table 1.1 shows the covariates we consider from the YCS. There are no missing data in the variables cohort and boy. The missingness pattern for GCSE score and the remaining two variables is shown in Table 1.2. In this example it is not possible to re-order the variables (items) to obtain a monotone pattern, due for example, to pattern 3 (N = 697).

Table 1.1 YCS variables for exploring the relationship between Year 11 attainment and social stratification

Variable nameDescriptioncohortyear of data collection: 1990, 93, 95, 97, 99boyindicator variable for boysoccupationparental occupation, categorised as managerial,intermediate or workingethnicitycategorised as Bangladeshi, Black, Indian,other Asian, Other, Pakistani or White

Table 1.2 Pattern of missing values in the YCS data.

Example 1.3 Asthma study (ctd)
Table 1.3 shows the withdrawal pattern for the placebo and lowest active dose arms (all the patients are receiving their randomised medication). We have removed three patients with unusual interim missing data from Table 1.3 and all our analyses. The remaining missingness pattern is monotone in both treatment arms.

Table 1.3 Asthma study: withdrawal pattern by treatment arm.

1.3.1 Consequences of missing data

Our focus is the practical implications of missing data for both parameter estimation and inference. Unfortunately, the two are often conflated, so that a computational method for parameter estimation when data are missing is said to have ‘solved’ or ‘handled’ the missing data issue. Since, with missing data, computational methods only lead to valid inference under specific assumptions, this attitude is likely to lead to misleading inferences.

In this context, it may be helpful to draw an analogy with the sampling process used to collect the data. If an analyst is presented with a spreadsheet containing columns of numerical data, they can analyse the data (calculate means of variables, regress variables on each other and so forth). However, they cannot draw any inferences unless they are told how and from whom the data were collected. This information is external to the numerical values of the variables.

We may think of the missing data mechanism as a second stage in the sampling process, but one that is not under our control. It acts on the data we intended to collect and leaves us with a partially observed dataset. Once again, the missing data mechanism cannot usually be definitively identified from the observed data, although the observed data may indicate plausible mechanisms (e.g. response may be negatively correlated with age). Thus we will need to make an assumption about the missingness mechanism in order to draw inference. The process of making this assumption is quite separate from the statistical methods we use for parameter estimation etc. Further, to the extent that the missing data mechanism cannot be definitively identified from the data, we will often wish to check the robustness of our inferences to a range of missingness mechanisms that are consistent with the observed data. The reason this book focuses on the statistical method of MI is that it provides a computationally feasible approach to the analysis for a wide range of problems under a range of missingness mechanisms.

We therefore begin with a typology for the mechanisms causing, or generating, the missing data.

Later in this chapter we will see that consideration of these mechanisms in the context of the analysis at hand clarifies the assumptions under which a simple analysis, such as restriction to complete records, will be valid. It also clarifies when more sophisticated computational approaches such as MI will be valid and informs the way they are conducted. We stress again that the mechanism causing the missing data can rarely be definitively established. Thus we will often wish to explore the robustness of our inferences to a range of plausible missingness mechanisms—a process we call sensitivity analysis.

From a general standpoint, missing data may cause two problems: loss of efficiency and bias.

First, loss of efficiency, or information, is an inevitable consequence of missing data. Unfortunately, the extent of information loss is not directly linked to the proportion of incomplete records. Instead it is intrinsically linked to the analysis question. When crossing the road, the rear of the oncoming traffic is hidden from view—the data are missing. However, these missing data do not bear on the question at hand—will I make it across the road safely? While the proportion of missing data about each oncoming vehicle is substantial, information loss is negligible. Conversely, when estimating the prevalence of a rare disease, a small proportion of missing observations could have a disproportionate impact on the resulting estimate.

Faced with an incomplete dataset, most software automatically restricts analysis to complete records. As we illustrate below, the consequence of this for loss of information is not always easy to predict. Nevertheless, in many settings it will be important to include the information from partially complete records. Not least of the reasons for this is the time and money it has taken to collect even the partially complete records. Under certain assumptions about the missingness mechanism, we shall see that MI provides a natural way to do this.

Second, and perhaps more fundamentally, the subset of complete records may not be representative of the population under study. Restricting analysis to complete records may then lead to biased inference. The extent of such bias depends on the statistical behaviour of the missing data. A formal framework to describe this behaviour is thus fundamental. Such a framework was first elucidated in a seminal paper by Rubin (1976). To describe this, we need some definitions.

1.4 Inferential framework and notation

For clarity we take a frequentist approach to inference. This is not essential or necessarily desirable; indeed we will see that MI is essentially a Bayesian method, with good frequentist properties. Often, as Chapter 2 shows, formally showing these frequentist properties is most difficult theoretically.

1.1

The missing value mechanism is then formally defined as

1.2

that is to say the probability of observing unit i's data given their potentially unseen values Yi. It is important to note that, in what follows, we assume that unit i's data exist (or at least existed). In other words, if it had been possible for us to be in the right place at the right time, we would have been able to observe the complete data. What (1.2) describes therefore, is the probability that the data collection we were able to undertake on unit i yielded values of Yi,0. Thus, (at least until we consider sensitivity analysis for clinical trials in Chapter 10) the missing data are not counter-factual, in the sense of what might have happened if a patient had taken a different drug from the one they actually took, or a child had gone to a different school from the one they actually attended.

Example 1.2 Youth Cohort Study (YCS) (ctd)
Here, underlying values of missing GCSE score, parental occupation and ethnicity exist, and given sufficient time and money we would be able to discover many of them.
Example 1.3 Asthma study (ctd)
Were resources not limited, researchers could have visited each patient in their home at each of the scheduled follow-up times to record their data.

1.4.1 Missing Completely At Random (MCAR)

We say data are Missing Completely At Random (MCAR) if the probability of a value being missing is unrelated to the observed and unobserved data on that unit. Algebraically,

1.3

Since, when data are MCAR, the chance of the data being missing is unrelated to the values, the observed data are therefore representative of the population. However, relative to the data we intended to collect, information has been lost.

Example 1.1 Mandarin tableau (ctd)
Suppose we wish to summarise facial characteristics of the figurines, e.g. average head circumference. If the missing heads are MCAR, a valid estimate is obtained from the observed heads. Although valid, it is imprecise relative to an estimate based on all the heads.
Before moving on, note that the MCAR assumption is made for a specific analysis. It is not a property of the tableau. It may be plausible to assume that headgear is MCAR, while heads may systematically be missing because of racial characteristics. Further, if we step back from the tableau, we may see that missing heads correspond to missing, or recently replaced, roof tiles. If so, the mechanism causing the missing data is clear: however the assumption of MCAR is still likely to be appropriate, because the mechanism causing the missing data is unlikely to bear on (i.e., is likely statistically independent of) the analysis question.
Similarly, in certain settings we may find that the variables predictive of missing data are independent of the substantive analysis at hand. This is consistent with the MCAR assumption: analysis of the complete records will be unbiased, but some precision is lost.
Example 1.2 Youth Cohort Study (ctd)
If data are MCAR in the YCS study, valid inference would be obtained from the 55145 complete records (Table 1.2). However, omitting the 8110 individuals with partial information means inferences are less precise than they could be.
Example 1.3 Asthma study (ctd)
Assuming data are MCAR, a valid estimate of the overall mean in each group at 12 weeks is obtained by simply averaging the 37 available observations in the placebo group and the 71 available observations in the active group. This gives, respectively 2.05 l (s.e. 0.09) and 2.23 l (s.e. 0.10).

1.4.2 Missing At Random (MAR)

We say data are Missing At Random (MAR) if given, or conditional on, the observed data the probability distribution of Ri is independent of the unobserved data. Recalling that for individual i we can partition Yi as (Yi, O, Yi, M) we can express this mathematically as

1.4

This does not mean—as is sometimes supposed—that the probability of observing a variable on an individual is independent of the value of that variable. Quite the contrary: under MAR the chance of observing a variable will depend on its value. Crucially though, given the observed data this dependence is broken. Consider the following example.

Example 1.4 Income and job type
Suppose we survey 100 employees of job type A and B for their income. Only 157 reveal their income, as shown in Figure 1.2. The figure shows that employees with higher incomes are less likely to divulge them: the probability of observing a variable depends on its value. However, if within job type A the probability of observing income does not depend on income, and within job type B the probability of observing income does not depend on income, then income is missing at random dependent on job type.

Figure 1.2 Plot of 200 hypothetical incomes against job type.

The immediate consequence of this is that the mean of the observed incomes, marginal to (or aggregating over) job type is biased downwards. The data were generated with a mean income of £ 60,000 in job type A and £ 30,000 in job type B, so that the true mean income is £ 45,000. Contrast the observed mean income of

We note three further points. First, if within job type the probability of observing income does not depend on income, it follows that:

Of course, were it observed, we could ‘test the MAR assumption’ in two ways: first a logistic regression, for example:

if MAR is true then the hypothesis is true. Or, we could fit a corresponding regression:

If MAR is true then the hypothesis is true.

This simple example draws out the following general points:

1. statements relating the probability of observing data to the values of data have direct consequences for conditional distributions of the data, and
2. under the MAR assumption, the precise missing data mechanism need not be specified; indeed the precise form can be different for different individuals.

These two points together mean that the MAR mechanism is much more subtle than might at first appear; these subtleties can manifest themselves unexpectedly.

Example 1.4 Income and job type (ctd)
Suppose the mechanism causing the missing income differed for each of the 200 individuals, that is

Then missing data are still MAR, and (1.7) is still a valid estimate.

Of course, it may be as contrived to think each individual has their own MAR mechanism as to think that the same mechanism holds for all. In a simple example this is not important, but in real applications a blanket assumption of MAR may be very contrived.

Example 1.5 Subtlety of MAR assumption
Suppose we have three variables, Yi,1, Yi,2, Yi,3, and we are unfortunate, so that our dataset contains nontrivial numbers of all possible missingness patterns, as shown in Table 1.4.

Table 1.4 Three variables: all possible missing value patterns

If the same missingness mechanism applies to all the units, and it is either MAR or MCAR, then it must be MCAR. If we wish to assume data are MAR, we are forced to split the data into groups among which different MAR mechanisms are operating. These groups need not necessarily be defined by the missing data patterns; they could be defined by characteristics of the units. Settings like this are considered by Harel and Schafer (2009). To illustrate, though, we define groups by the missing data patterns.
For a MAR mechanism, we might assume the following:
in patterns (1, 2) Yi, 3 is MAR given Yi,1, Yi,2;in patterns (3, 4, 7) Yi, 1 and/or Yi, 2 is MAR given Yi, 3, andin patterns (5, 6) data are MCAR.
In practice, often a relatively small number of the possible missingness patterns predominate, and it is assumptions about these that are important for any analysis. The remaining—relatively infrequent—patterns can often be assumed MCAR, with little risk to the final inference if this assumption is in fact wrong.

Faced with complex data, there is a temptation to invoke the MAR assumption too readily, especially as this simplifies any analysis using MI. To guard against this, analysts need to be satisfied that any associations assumed to justify the MAR assumption are at least consistent with the observed data. Since consideration of selection mechanisms may not be as straightforward as might first appear, it can also be worth considering the plausibility of MAR from the point of view of the joint and conditional distribution of the data. As (1.6) illustrates, for MAR we need to be satisfied

1. that conditional distributions of partially observed variables given fully observed variables do not differ depending on whether the data are observed, and
2. in consequence the joint distribution of the data can be validly estimated by piecing together the marginal distributions of the observed patterns.

The above discussion explains why we do not regard the MAR assumption as a panacea, but nevertheless often both a plausible and practical starting point for the analysis of partially observed data. In particular, the points drawn out of Example 1.4 are not specific to either the number or type of variables (categorical or quantitative).

Example 1.1 Mandarin tableau (ctd)
Here the MAR assumption says that the distribution of head characteristics given body characteristics (i.e., dress, height, etc.) does not depend on whether the head is present. Thus, under MAR we can estimate the distribution of characteristics of figurines with missing heads from figurines with similar body characteristics.
Notice the two rightmost figurines in Figure 1.1 share the same necktie. Assuming headdress is MAR given necktie, the missing headdress on the rightmost figurine is similar to that on the second rightmost figurine.
Clearly this assumption cannot be checked from the tableau (data) at hand. However it might be possible to explore it using other tableaux (i.e., other datasets). If MAR is plausible for headdress given necktie, it does not mean it is plausible for skin colour given necktie. In other words MAR is an assumption we make for the analysis, not a characteristic of the dataset. For some analyses of partially observed data it may be plausible; for others not.

1.4.3 Missing Not At Random (MNAR)

If the mechanism causing missing data is neither MCAR nor MAR, we say it is Missing Not At Random (MNAR). Under a MNAR mechanism, the probability of an observation being missing depends on the underlying value, and this dependence remains even given the observed data. Mathematically,

1.8

While in some settings MNAR may be more plausible than MAR, analysis under MNAR is considerably harder. This is because under MAR, equation (1.6) showed that conditional distributions of partially observed variables given fully observed variables are the same in units who do, and do not, have the data observed. However (1.6) does not hold if (1.8) holds.

It follows that inference under MNAR involves an explicit specification of either the selection mechanism, or how conditional distributions of partially observed variables given fully observed variables differ between units who do, and do not, have the data observed.

Formally, we can write the joint distribution of unit i's variables, Yi, and the indicator for observing those variables, Ri as

1.9

In the centre is the joint distribution, and this can be written either as

1. a selection model—the LHS of (1.9), i.e., a product of (i) the conditional probability of observing the variables, given their values and (ii) the marginal distribution of the data, OR
2. a pattern mixture model—the RHS of (1.9), i.e., a product of (i) the probability distribution of the data within each missingness pattern and (ii) the marginal probability of the missingness pattern.

Thus we can specify a MNAR mechanism either by specifying the selection model (which implies the pattern mixture model) or by specifying a pattern mixture model (which implies a selection model). Depending on the context, both approaches may be helpful. Unfortunately, even in apparently simple settings, explicitly calculating the selection implication of a pattern mixture model, or vice versa, can be awkward. We shall see in Chapter 10 that an advantage of multiple imputation is that, given a pattern mixture model, we can estimate the selection model implications quite easily.

Once again, as the example below shows, MNAR is an assumption for the analysis, not a characteristic of the data.

Example 1.1 Mandarin tableau (ctd)
It may be that the figurines with missing heads were wearing a head dress that identified them as a member of a class, or group, that subsequently became very unpopular—causing the heads to be smashed. This MNAR selection mechanism means that we cannot say anything about the typical characteristics of head dress without making untestable assumptions about the characteristics of the missing head dresses. Further, the MNAR assumption implies that the distribution of head dress given body dress is different for figurines with missing and observed heads.
We reiterate, under MNAR any summary statistics, or analyses, require either explicit assumptions about the form of the distribution of the missing data given the observed or explicit specification of the selection mechanism and the marginal distribution of the full (including unobserved) data. Contrast this with analyses assuming MAR, where these assumptions are made implicitly.
We repeat a point from the tableau: if head dress was the trigger for missing heads, but the type of head dress worn is not related to physical characteristics of the heads, analyses concerning their physical characteristics could be validly performed under MAR. Just because the heads are MNAR does not mean all analyses require the MNAR assumption. This underlines that, in applications, it is crucial to think carefully about the selection mechanism, and how it affects the analysis question.

The example above illustrates that when data are MNAR, instead of thinking about the selection mechanism, it is equally appropriate to consider differences between conditional distributions of partially observed given fully observed variables. Under MAR such distributions do not differ depending on whether data is missing or not; under MNAR they do. Considering the conditional distribution of the observed data, and then exploring the robustness of inference as it is allowed to differ in the unobserved data, is therefore a natural way to explore the robustness of inference to an assumption of MAR. From our perspective it has two further advantages: (i) the differences can be expressed simply and pictorially, and (ii) MI provides a natural route for inference. Unfortunately, the selection counterparts, or implications, of pattern mixture models are rarely easy to calculate directly, but again MI can help: after imputing missing data under a pattern mixture model, it is straightforward to explore implications for the implied selection model.

1.4.4 Ignorability

If, under a specific assumption about the missingness mechanism, we can construct a valid analysis that does not require us to explicitly include the model for that missing value mechanism, we term the mechanism, in the context of this analysis, ignorable.

A common example of this is a likelihood based analysis assuming MAR.

However, as we see below there are other settings, where we do not assume MAR, that do not require us to explicitly include the model for the missingness mechanism yet still result in valid inference. For example, as discussed in Section 1.6.2, a complete records regression analysis is valid if data are MNAR dependent only on the covariates.

1.5 Using observed data to inform assumptions about the missingness mechanism

We have already noted that, given the observed data, we cannot definitively identify the missingness mechanism. Nevertheless, the observed data can help frame plausible assumptions about this—in other words assumptions which are consistent with the observed data. Exploratory analyses of this nature are important for (i) assessing whether a complete records analysis is likely to be biased and (ii) framing appropriate imputation models. Two key tools for this are summaries (tabular or graphical) of fully observed, or near-fully observed variables by missingness pattern and logistic regression of missingness indicators on observed, or near-fully observed variables.

Example 1.3 Asthma study (ctd)
Table 1.5 shows the mean FEV1 by dropout pattern. In the placebo arm, patterns 3 and 4 have lower FEV1 at baseline, and for patterns 2–4 FEV1 declines from baseline to last visit. In the active arm, patterns 1, 2 show a similar increase of about 0.20 ml, while pattern 3 starts higher and shows little change, while pattern 4 shows marked decline. Notice also the increase in variance in the active arm over time which is different from the treatment arm. This is a common feature of such data, and should be reflected in the analysis.

Table 1.5 Asthma study: mean FEV1 (litres) at each visit, by dropout pattern and intervention arm.

MAR mechanisms that are dependent on treatment and response are consistent with these data. However, there is a suspicion that further decline between the last observed and first missing visit triggered withdrawal, probably followed in the placebo arm by switching to an active treatment. Thus it would be useful to explore sensitivity of treatment inferences to MNAR, which we do in Chapter 10.

1.6 Implications of missing data mechanisms for regression analyses

Usually, we will wish to fit some form of regression model to address our substantive questions. Here, we look at the implications, in terms of bias and loss of information, of missing data in the response and/or covariates under different missingness mechanisms. We first focus on linear regression; our findings there hold for most other regression models, including relative risk regression and survival analysis. Logistic regression is more subtle; we discuss this in Section 1.6.4.

1.6.1. Partially observed response

Suppose we wish to fit the model

1.12

but Y is partially observed. Let Ri indicate whether Yi is observed. For now assume that the xi are known without error; for example it may be a design variable. Then the contribution to the likelihood for from unit i, conditional on xi, is

1.13

Assume, as will typically be the case, that the parameters of , β, are distinct from the parameters of .

Figure 1.2 suggests that, provided Y is MAR given the covariates in the model, units with missing response have no information about β. To see this formally, first observe that as Yi