61,99 €
A practical guide to analysing partially observed data.
Collecting, analysing and drawing inferences from data is central to research in the medical and social sciences. Unfortunately, it is rarely possible to collect all the intended data. The literature on inference from the resulting incomplete data is now huge, and continues to grow both as methods are developed for large and complex data structures, and as increasing computer power and suitable software enable researchers to apply these methods.
This book focuses on a particular statistical method for analysing and drawing inferences from incomplete data, called Multiple Imputation (MI). MI is attractive because it is both practical and widely applicable. The authors aim is to clarify the issues raised by missing data, describing the rationale for MI, the relationship between the various imputation models and associated algorithms and its application to increasingly complex data structures.
Multiple Imputation and its Application:
Multiple Imputation and its Application is aimed at quantitative researchers and students in the medical and social sciences with the aim of clarifying the issues raised by the analysis of incomplete data data, outlining the rationale for MI and describing how to consider and address the issues that arise in its application.
Sie lesen das E-Book in den Legimi-Apps auf:
Seitenzahl: 576
Veröffentlichungsjahr: 2012
Table of Contents
Statistics in Practice
Title Page
Copyright
Preface
Data Acknowledgements
Acknowledgements
Glossary
Part I: Foundations
Chapter 1: Introduction
1.1 Reasons for missing data
1.2 Examples
1.3 Patterns of missing data
1.4 Inferential framework and notation
1.5 Using observed data to inform assumptions about the missingness mechanism
1.6 Implications of missing data mechanisms for regression analyses
1.7 Summary
Chapter 2: The Multiple Imputation Procedure and Its Justification
2.1 Introduction
2.2 Intuitive outline of the MI procedure
2.3 The generic MI procedure
2.4 Bayesian justification of MI
2.5 Frequentist inference
2.6 Choosing the number of imputations
2.7 Some simple examples
2.8 MI in more general settings
2.9 Constructing congenial imputation models
2.10 Practical considerations for choosing imputation models
2.11 Discussion
Part III: Advanced Topics
Chapter 3: Multiple Imputation of Quantitative Data
3.1 Regression imputation with a monotone missingness pattern
3.2 Joint modelling
3.3 Full conditional specification
3.4 Full conditional specification versus joint modelling
3.5 Software for multivariate normal imputation
3.6 Discussion
Chapter 4: Multiple Imputation of Binary and Ordinal Data
4.1 Sequential imputation with monotone missingness pattern
4.2 Joint modelling with the multivariate normal distribution
4.3 Modelling binary data using latent normal variables
4.4 General location model
4.5 Full conditional specification
4.6 Issues with over-fitting
4.7 Pros and cons of the various approaches
4.8 Software
4.9 Discussion
Chapter 5: Multiple Imputation of Unordered Categorical Data
5.1 Monotone missing data
5.2 Multivariate normal imputation for categorical data
5.3 Maximum indicant model
5.4 General location model
5.5 FCS with categorical data
5.6 Perfect prediction issues with categorical data
5.7 Software
5.8 Discussion
Chapter 6: Nonlinear Relationships
6.1 Passive imputation
6.2 No missing data in nonlinear relationships
6.3 Missing data in nonlinear relationships
6.4 Discussion
Chapter 7: Interactions
7.1 Interaction variables fully observed
7.2 Interactions of categorical variables
7.3 General nonlinear relationships
7.4 Software
7.5 Discussion
Part III: Advanced Topics
Chapter 8: Survival Data, Skips and Large Datasets
8.1 Time-to-event data
8.2 Nonparametric, or ‘hot deck’ imputation
8.3 Multiple imputation for skips
8.4 Two-stage MI
8.5 Large datasets
8.6 Multiple imputation and record linkage
8.7 Measurement error
8.8 Multiple imputation for aggregated scores
8.9 Discussion
Chapter 9: Multilevel Multiple Imputation
9.1 Multilevel imputation model
9.2 MCMC algorithm for imputation model
9.3 Imputing level-2 covariates using FCS
9.4 Individual patient meta-analysis
9.5 Extensions
9.6 Discussion
Chapter 10: Sensitivity Analysis: MI Unleashed
10.1 Review of MNAR modelling
10.2 Framing sensitivity analysis
10.3 Pattern mixture modelling with MI
10.4 Pattern mixture approach with longitudinal data via MI
10.5 Piecing together post-deviation distributions from other trial arms
10.6 Approximating a selection model by importance weighting
10.7 Discussion
Chapter 11: Including Survey Weights
11.1 Using model based predictions
11.2 Bias in the MI variance estimator
11.3 A multilevel approach
11.4 Further developments
11.5 Discussion
Chapter 12: Robust Multiple Imputation
12.1 Introduction
12.2 Theoretical background
12.3 Robust multiple imputation
12.4 Simulation studies
12.5 The RECORD study
12.6 Discussion
Appendix A: Markov Chain Monte Carlo
Appendix B: Probability Distributions
B.1 Posterior for the multivariate normal distribution
Bibliography
Index of Authors
Index of Examples
Index
Statistics in Practice
Statistics in Practice
Series Advisory Editors
Marian Scott
University of Glasgow, UK
Stephen Senn
CRP-Santé, Luxembourg
Wolfgang Jank
University of Maryland, USA
Founding Editor
Vic Barnett
Nottingham Trent University, UK
Statistics in Practice is an important international series of texts which provide detailed coverage of statistical concepts, methods and worked case studies in specific fields of investigation and study.
With clear explanations and many worked practical examples, the books show in down-to-earth terms how to select and use an appropriate range of statistical techniques in a particular practical field within each title's special topic area.
The books provide statistical support for professionals and research workers across a range of fields of employment and research environments. Subject areas covered include medicine and pharmaceuticals; industry, finance and commerce; public services, and the earth and environmental sciences.
The books also provide support to students studying applied statistics courses in these areas. The demand for applied statistics graduates in these areas has led to such courses becoming increasingly prevalent at universities and colleges.
It is our aim to present judiciously chosen and well-written textbooks to meet everyday practical needs. Feedback from readers will be valuable in monitoring our success.
This edition first published 2013
© 2013 John Wiley & Sons, Ltd
Registered office
John Wiley & Sons Ltd, The Atrium, Southern Gate, Chichester, West Sussex, PO19 8SQ, United Kingdom
For details of our global editorial offices, for customer services and for information about how to apply for permission to reuse the copyright material in this book please see our website at www.wiley.com.
The right of the author to be identified as the author of this work has been asserted in accordance with the Copyright, Designs and Patents Act 1988.
All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, recording or otherwise, except as permitted by the UK Copyright, Designs and Patents Act 1988, without the prior permission of the publisher.
Wiley also publishes its books in a variety of electronic formats. Some content that appears in print may not be available in electronic books.
Designations used by companies to distinguish their products are often claimed as trademarks. All brand names and product names used in this book are trade names, service marks, trademarks or registered trademarks of their respective owners. The publisher is not associated with any product or vendor mentioned in this book. This publication is designed to provide accurate and authoritative information in regard to the subject matter covered. It is sold on the understanding that the publisher is not engaged in rendering professional services. If professional advice or other expert assistance is required, the services of a competent professional should be sought.
Library of Congress Cataloging-in-Publication Data
Carpenter, James R.
Multiple imputation and its application / James R. Carpenter, Michael G. Kenward. — 1st ed.
p. ; cm.
Includes bibliographical references and index.
ISBN 978-0-470-74052-1 (hardback)
I. Kenward, Michael G., 1956- II. Title.
[DNLM: 1. Data Interpretation, Statistical. 2. Biomedical Research – methods. WA 950]
610.72′4–dc23
2012028821
A catalogue record for this book is available from the British Library.
ISBN: 978-0-470-74052-1
Cover photograph courtesy of Harvey Goldstein
Preface
No study of any complexity manages to collect all the intended data. Analysis of the resulting partially collected data must therefore address the issues raised by the missing data. Unfortunately, the inferential consequences of missing data are not simply restricted to the proportion of missing observations. Instead, the interplay between the substantive questions and the reasons for the missing data is crucial. Thus, there is no simple, universal, solution.
Suppose, for the substantive question at hand, the inferential consequences of missing data are nontrivial. Then the analyst must make a set of assumptions about the reasons, or mechanisms, causing data to be missing, and perform an inferentially valid analysis under these assumptions. In this regard, analysis of a partially observed dataset is the same as any statistical analysis; the difference is that when data are missing we cannot assess the validity of these assumptions in the way we might do in a regression analysis, for example. Hence, sensitivity analysis, where we explore the robustness of inference to different assumptions about the reasons for missing data, is important.
Given a set of assumptions about the reasons data are missing, there are a number of statistical methods for carrying out the analysis. These include the EM algorithm, inverse probability weighting, a full Bayesian analysis and, depending on the setting, a direct application of maximum likelihood. These methods, and those derived from them, each have their own advantages in particular settings. Nevertheless, we argue that none shares the practical utility, broad applicability and relative simplicity of Rubin's Multiple Imputation (MI).
Following an introductory chapter outlining the issues raised by missing data, the focus of this book is therefore MI. We outline its theoretical basis, and then describe its application to a range of common analysis in the medical and social sciences, reflecting the wide application that MI has seen in recent years. In particular, we describe its application with nonlinear relationships and interactions, with survival data and with multilevel data. The last three chapters consider practical sensitivity analyses, combining MI with inverse probability weighting, and doubly robust MI.
Self-evidently, a key component of an MI analysis, is the construction of an appropriate method of imputation. There is no unique, ideal, way in which this should be done. In particular, there there has been some discussion in the literature about the relative merits of the joint modelling and full conditional specification approaches. We have found that thinking in terms of joint models is both natural and convenient for formulating imputation models, a range of which can then be (approximately) implemented using a full conditional specification approach. Differences in computational speed between joint modelling and full conditional specification are generally due to coding efficiency, rather than intrinsic superiority of one method over the other.
Throughout the book we illustrate the ideas with several examples. The code used for these examples, in various software packages, is available from the book's home page, which is at http://www.wiley.com/go/multiple_imputation, together with exercises to go with each chapter.
We welcome feedback from readers; any comments and corrections should be e-mailed to [email protected]. Unfortunately, we cannot promise to respond individually to each message.
Data Acknowledgements
We are grateful to the following:
In Chapters 1, 5, 8, 10 and 11 we have analysed data from the Youth Cohort Time Series for England, Wales and Scotland, 1984-2002 First Edition, Colchester, Essex, published by and freely available from the UK Data Archive, Study Number SN 5765. We thank Vernon Gayle for introducing us to these data.
In Chapter 6 we have analysed data from the Alzheimer's Disease Neuro-imaging Initiative (ADNI) database (adni.loni.ucla.edu). As such, the investigators within the ADNI contributed to the design and implementation of ADNI and/or provided data but did not participate in the analysis or in the writing of this book. A complete listing of ADNI investigators can be found at: http://adni.loni.ucla.edu/wp-content/uploads/how_to_apply/ADNI_Acknowledgement_List.pdf.
The ADNI was launched in 2003 by the National Institute on Aging (NIA), the National Institute of Biomedical Imaging and Bioengineering (NIBIB), the Food and Drug Administration (FDA), private pharmaceutical companies and nonprofit organisations, as a $60 million, five-year public-private partnership. The primary goal of ADNI has been to test whether serial magnetic resonance imaging (MRI), positron emission tomography (PET), other biological markers, and clinical and neuropsychological assessment can be combined to measure the progression of mild cognitive impairment (MCI) and early Alzheimer's disease (AD). Determination of sensitive and specific markers of very early AD progression is intended to aid researchers and clinicians to develop new treatments and monitor their effectiveness, as well as lessen the time and cost of clinical trials.
The Principal Investigator of this initiative is Michael W. Weiner, MD, VA Medical Center and University of California San Francisco. ADNI is the result of the efforts of many co-investigators from a broad range of academic institutions and private corporations, and subjects have been recruited from over 50 sites across the US and Canada. The initial goal of ADNI was to recruit 800 adults, ages 55 to 90, to participate in the research, with approximately 200 cognitively normal older individuals to be followed for 3 years, 400 people with MCI to be followed for 3 years and 200 people with early AD to be followed for 2 years. For up-to-date information, see www.adni-info.org.
Data collection and sharing for this project was funded by the Alzheimer's Disease Neuroimaging Initiative (ADNI) (National Institutes of Health Grant U01 AG024904). ADNI is funded by the National Institute on Aging, the National Institute of Biomedical Imaging and Bioengineering, and through generous contributions from the following: Abbott; Alzheimer's Association; Alzheimer's Drug Discovery Foundation; Amorfix Life Sciences Ltd.; AstraZeneca; Bayer HealthCare; BioClinica, Inc.; Biogen Idec, Inc.; Bristol-Myers Squibb Company; Eisai, Inc.; Elan Pharmaceuticals, Inc.; Eli Lilly and Company; F. Hoffmann-La Roche Ltd and its affiliated company Genentech, Inc.; GE Healthcare; Innogenetics, N.V.; IXICO Ltd.; Janssen Alzheimer Immunotherapy Research & Development, LLC; Johnson & Johnson Pharmaceutical Research & Development, LLC; Medpace, Inc.; Merck & Co., Inc.; Meso Scale Diagnostics, LLC; Novartis Pharmaceuticals Corporation; Pfizer, Inc.; Servier; Synarc, Inc.; and Takeda Pharmaceutical Company. The Canadian Institutes of Health Research is providing funds to support ADNI clinical sites in Canada. Private sector contributions are facilitated by the Foundation for the National Institutes of Health (www.fnih.org). The grantee organisation is the Northern California Institute for Research and Education, and the study is coordinated by the Alzheimer's Disease Cooperative Study at the University of California, San Diego. ADNI data are disseminated by the Laboratory for Neuro-imaging at the University of California, Los Angeles. This research was also supported by NIH grants P30 AG010129 and K01 AG030514.
In Chapter 7 we have analysed data from the 1958 National Childhood Development Study. This is published, and freely available from the UK Data Archive, Study Number SN 5565 (waves 0–3) and SN 5566 (wave 4). We thank Ian Plewis for introducing us to these data.
Acknowledgements
No book of this kind is written in a vacuum, and we are grateful to many friends and colleagues for research collaborations, stimulating discussions and comments on draft chapters.
In particular we would like to thank members of the Missing Data Imputation and Analysis (MiDIA) group, including (in alphabetical order) Jonathan Bartlett, John Carlin, Rhian Daniel, Dan Jackson, Shaun Seaman, Jonathan Sterne, Kate Tilling and Ian White.
We would also like to acknowledge many years of collaboration with Geert Molenberghs, James Roger and Harvey Goldstein.
James would like to thank Mike Elliott, Rod Little, Trivellore Raghunathan and Jeremy Taylor for facilitating a visit to the Institute for Social Research and Department of Biostatistics at the University of Michigan, Ann Arbor, in Summer 2011, when the majority of the first draft was written.
Thanks to Tim Collier for the anecdote in §1.3.
We also gratefully acknowledge funding support from the ESRC (3-year fellowship for James Carpenter, RES-063-27-0257, and follow-on funding RES-189-25-0103) and MRC (grants G0900724, G0900701 and G0600599).
We would also like to thank Richard Davies and Kathryn Sharples at Wiley for their encouragement and support.
Lastly, thanks to our families for their forbearance and understanding over the course of this project.
Despite the encouragement and support of those listed above, the text inevitably contains errors and shortcomings, for which we take full responsibility.
James Carpenter and Mike KenwardLondon School of Hygiene & Tropical Medicine
Glossary
i
indexes units, often individuals, unless defined otherwise
j
indexes variables in the data set, unless defined otherwise
n
total number of units in the data set, unless defined otherwise
p
depending on context, number of variables in a data set or number of parameters in a statistical model
X,Y,Z
random variables
Y
i,j
i
th
observation on
j
th
variable,
i
= 1,…,
n
,
j
= 1,…,
p
.
θ
generic parameter
θ
generic parameter column vector, typically
p
× 1
β, γ, δ
regression coefficients
β
column vector of regression coefficients, typically
p
× 1.
Ω
matrix, typically of dimension
p
×
p
.
Ω
i,j
i, j
th
element of Ω
Ω
T
transpose of
Ω
so that .
Y
j
= (
Y
1,
j
,…,
Y
n,j
)
T
n
× 1 column vector of observations on variable
j
.
tr(
Ω
)
sum of diagonal elements of
Ω
,
ie
known as the trace of the matrix.
AIPW
Augmented Inverse Probability Weighting
CAR
Censoring At Random
CNAR
Censoring Not At Random
EM
Expectation Maximisation
FCS
Full Conditional Specification
FEV
1
Forced Expiratory Volume in 1 second (measured in litres)
FMI
Fraction of Missing Information
IPW
Inverse Probability Weighting
MAR
Missing At Random
MCAR
Missing Completely At Random
MI
Multiple Imputation
MNAR
Missing Not At Random
POD
Partially Observed Data
POM
Probability Of Missingness
S.E.
Standard error
f
(.)
probability distribution function
F
(.)
cumulative distribution function
‘|’
to be verbalised ‘given’, as in
f
(
Y
|
X
) ‘the probability distribution function of
Y
given
X
’
Part I
Foundations
Chapter 1
Introduction
Collecting, analysing and drawing inferences from data are central to research in the medical and social sciences. Unfortunately, for any number of reasons, it is rarely possible to collect all the intended data. The ubiquity of missing data, and the problems this poses for both analysis and inference, has spawned a substantial statistical literature dating from 1950s. At that time, when statistical computing was in its infancy, many analyses were only feasible because of the carefully planned balance in the dataset (for example, the same number of observations on each unit). Missing data meant the available data for analysis were unbalanced, thus complicating the planned analysis and in some instances rendering it unfeasible. Early work on the problem was therefore largely computational (e.g. Healy and Westmacott, 1956; Afifi and Elashoff, 1966; Orchard and Woodbury, 1972; Dempster et al., 1977).
The wider question of the consequences of nontrivial proportions of missing data for inference was neglected until a seminal paper by Rubin (1976). This set out a typology for assumptions about the reasons for missing data, and sketched their implications for analysis and inference. It marked the beginning of a broad stream of research about the analysis of partially observed data. The literature is now huge, and continues to grow, both as methods are developed for large and complex data structures, and as increasing computer power and suitable software enable researchers to apply these methods.
For a broad overview of the literature, a good place to start is one of the recent excellent textbooks. Little and Rubin (2002) write for applied statisticians. They give a good overview of likelihood methods, and give an introduction to multiple imputation. Allison (2002) presents a less technical overview. Schafer (1997) is more algorithmic, focusing on the EM algorithm and imputation using the multivatiate normal and general location model. Molenberghs and Kenward (2007) focus on clinical studies, while Daniels and Hogan (2008) focus on longitudinal studies with a Bayesian emphasis.
The above books concentrate on parametric approaches. However, there is also a growing literature based around using inverse probability weighting, in the spirit of Horvitz and Thompson (1952), and associated doubly robust methods. In particular, we refer to the work of Robins and colleagues (e.g. Robins et al., 1995; Scharfstein et al., 1999). Vansteelandt et al. (2009) give an accessible introduction to these developments. A comparison with multiple imputation in a simple setting is given by Carpenter et al. (2006). The pros and cons are debated in Kang and Schafer (2007) and the theory is brought together by Tsiatis (2006).
This book is concerned with a particular statistical method for analysing and drawing inferences from incomplete data, called Multiple Imputation (MI). Initially proposed by Rubin (1987) in the context of surveys, increasing awareness among researchers about the possible effects of missing data (e.g. Klebanoff and Cole, 2008) has led to an upsurge of interest (e.g. Sterne et al., 2009; Kenward and Carpenter, 2007; Schafer, 1999a; Rubin, 1996).
Multiple imputation (MI) is attractive because it is both practical and widely applicable. Recently developed statistical software (see, for example, issue 45 of the Journal of Statistical Software) has placed it within the reach of most researchers in the medical and social sciences, whether or not they have undertaken advanced training in statistics. However, the increasing use of MI in a range of settings beyond that originally envisaged has led to a bewildering proliferation of algorithms and software. Further, the implication of the underlying assumptions in the context of the data at hand is often unclear.
We are writing for researchers in the medical and social sciences with the aim of clarifying the issues raised by missing data, outlining the rationale for MI, explaining the motivation and relationship between the various imputation algorithms, and describing and illustrating its application to increasingly complex data structures.
Central to the analysis of partially observed data is an understanding of why the data are missing and the implications of this for the analysis. This is the focus of the remainder of this chapter. Introducing some of the examples that run through the book, we show how Rubin's typology (Rubin, 1976) provides the foundational framework for understanding the implications of missing data.
In this section we consider possible reasons for missing data, illustrate these with examples, and draw some preliminary implications for inference. We use the word ‘possible’ advisedly, since with partially observed data we can rarely be sure of the mechanism giving rise to missing data. Instead, a range of possible mechanisms are consistent with the observed data. In practice, we therefore wish to analyse the data under different mechanisms, to establish the robustness of our inference in the face of uncertainty about the missingness mechanism.
All datasets consist of a series of units each of which provides information on a series of items. For example, in a cross-sectional questionnaire survey, the units would be individuals and the items their answers to the questions. In a household survey, the units would be households, and the items information about the household and members of the household. In longitudinal studies, units would typically be individuals while items would be longitudinal data from those individuals. In this book, units therefore correspond to the highest level in multilevel (i.e., hierarchical) data, and unless stated otherwise data from different units are statistically independent.
Within this framework, it is useful to distinguish between units where all the information is missing, termed unit nonresponse and units who contribute partial information, termed item nonresponse. The statistical issues are the same in both cases, and both can in principle be handled by MI. However, the main focus of this book is the latter.
Figure 1.1 Detail from a senior mandarin's house front in New Territories, Hong Kong. Photograph by H. Goldstein.
We now introduce two key examples, which we return to throughout the book.
It is very important to investigate the patterns of missing data before embarking on a formal analysis. This can throw up vital information that might otherwise be overlooked, and may even allow the missing data to be traced. For example, when analysing the new wave of a longitudinal survey, a colleague's careful examination of missing data patterns established that many of the missing questionnaires could be traced to a set of cardboard boxes. These turned out to have been left behind in a move. They were recovered and the data entered.
Most statistical software now has tools for describing the pattern of missing data. Key questions concern the extent and patterns of missing values, and whether the pattern is monotone (as described in the next paragraph), as if it is, this can considerably speed up and simplify the analysis.
Missing data in a set of p variables are said to follow a monotone missingness pattern if the variables can be re-ordered such that, for every unit i and variable j,
A natural setting for the occurrence of monotone missing data is a longitudinal study, where units are observed either until they are lost to follow-up, or the study concludes. A monotone pattern is thus inconsistent with interim missing data, where units are observed for a period, missing for the subsequent period, but then observed. Questionnaires may also give rise to monotone missing data patterns when individuals systematically answer each question in turn from the beginning till they either stop or complete the questionnaire. In other settings it may be possible to re-order items to achieve a monotone pattern.
Table 1.1 YCS variables for exploring the relationship between Year 11 attainment and social stratification
Variable nameDescriptioncohortyear of data collection: 1990, 93, 95, 97, 99boyindicator variable for boysoccupationparental occupation, categorised as managerial,intermediate or workingethnicitycategorised as Bangladeshi, Black, Indian,other Asian, Other, Pakistani or WhiteTable 1.2 Pattern of missing values in the YCS data.
Table 1.3 Asthma study: withdrawal pattern by treatment arm.
Our focus is the practical implications of missing data for both parameter estimation and inference. Unfortunately, the two are often conflated, so that a computational method for parameter estimation when data are missing is said to have ‘solved’ or ‘handled’ the missing data issue. Since, with missing data, computational methods only lead to valid inference under specific assumptions, this attitude is likely to lead to misleading inferences.
In this context, it may be helpful to draw an analogy with the sampling process used to collect the data. If an analyst is presented with a spreadsheet containing columns of numerical data, they can analyse the data (calculate means of variables, regress variables on each other and so forth). However, they cannot draw any inferences unless they are told how and from whom the data were collected. This information is external to the numerical values of the variables.
We may think of the missing data mechanism as a second stage in the sampling process, but one that is not under our control. It acts on the data we intended to collect and leaves us with a partially observed dataset. Once again, the missing data mechanism cannot usually be definitively identified from the observed data, although the observed data may indicate plausible mechanisms (e.g. response may be negatively correlated with age). Thus we will need to make an assumption about the missingness mechanism in order to draw inference. The process of making this assumption is quite separate from the statistical methods we use for parameter estimation etc. Further, to the extent that the missing data mechanism cannot be definitively identified from the data, we will often wish to check the robustness of our inferences to a range of missingness mechanisms that are consistent with the observed data. The reason this book focuses on the statistical method of MI is that it provides a computationally feasible approach to the analysis for a wide range of problems under a range of missingness mechanisms.
We therefore begin with a typology for the mechanisms causing, or generating, the missing data.
Later in this chapter we will see that consideration of these mechanisms in the context of the analysis at hand clarifies the assumptions under which a simple analysis, such as restriction to complete records, will be valid. It also clarifies when more sophisticated computational approaches such as MI will be valid and informs the way they are conducted. We stress again that the mechanism causing the missing data can rarely be definitively established. Thus we will often wish to explore the robustness of our inferences to a range of plausible missingness mechanisms—a process we call sensitivity analysis.
From a general standpoint, missing data may cause two problems: loss of efficiency and bias.
First, loss of efficiency, or information, is an inevitable consequence of missing data. Unfortunately, the extent of information loss is not directly linked to the proportion of incomplete records. Instead it is intrinsically linked to the analysis question. When crossing the road, the rear of the oncoming traffic is hidden from view—the data are missing. However, these missing data do not bear on the question at hand—will I make it across the road safely? While the proportion of missing data about each oncoming vehicle is substantial, information loss is negligible. Conversely, when estimating the prevalence of a rare disease, a small proportion of missing observations could have a disproportionate impact on the resulting estimate.
Faced with an incomplete dataset, most software automatically restricts analysis to complete records. As we illustrate below, the consequence of this for loss of information is not always easy to predict. Nevertheless, in many settings it will be important to include the information from partially complete records. Not least of the reasons for this is the time and money it has taken to collect even the partially complete records. Under certain assumptions about the missingness mechanism, we shall see that MI provides a natural way to do this.
Second, and perhaps more fundamentally, the subset of complete records may not be representative of the population under study. Restricting analysis to complete records may then lead to biased inference. The extent of such bias depends on the statistical behaviour of the missing data. A formal framework to describe this behaviour is thus fundamental. Such a framework was first elucidated in a seminal paper by Rubin (1976). To describe this, we need some definitions.
For clarity we take a frequentist approach to inference. This is not essential or necessarily desirable; indeed we will see that MI is essentially a Bayesian method, with good frequentist properties. Often, as Chapter 2 shows, formally showing these frequentist properties is most difficult theoretically.
1.1
The missing value mechanism is then formally defined as
1.2
that is to say the probability of observing unit i's data given their potentially unseen values Yi. It is important to note that, in what follows, we assume that unit i's data exist (or at least existed). In other words, if it had been possible for us to be in the right place at the right time, we would have been able to observe the complete data. What (1.2) describes therefore, is the probability that the data collection we were able to undertake on unit i yielded values of Yi,0. Thus, (at least until we consider sensitivity analysis for clinical trials in Chapter 10) the missing data are not counter-factual, in the sense of what might have happened if a patient had taken a different drug from the one they actually took, or a child had gone to a different school from the one they actually attended.
We say data are Missing Completely At Random (MCAR) if the probability of a value being missing is unrelated to the observed and unobserved data on that unit. Algebraically,
1.3
Since, when data are MCAR, the chance of the data being missing is unrelated to the values, the observed data are therefore representative of the population. However, relative to the data we intended to collect, information has been lost.
We say data are Missing At Random (MAR) if given, or conditional on, the observed data the probability distribution of Ri is independent of the unobserved data. Recalling that for individual i we can partition Yi as (Yi, O, Yi, M) we can express this mathematically as
1.4
This does not mean—as is sometimes supposed—that the probability of observing a variable on an individual is independent of the value of that variable. Quite the contrary: under MAR the chance of observing a variable will depend on its value. Crucially though, given the observed data this dependence is broken. Consider the following example.
Figure 1.2 Plot of 200 hypothetical incomes against job type.
The immediate consequence of this is that the mean of the observed incomes, marginal to (or aggregating over) job type is biased downwards. The data were generated with a mean income of £ 60,000 in job type A and £ 30,000 in job type B, so that the true mean income is £ 45,000. Contrast the observed mean income of
We note three further points. First, if within job type the probability of observing income does not depend on income, it follows that:
Of course, were it observed, we could ‘test the MAR assumption’ in two ways: first a logistic regression, for example:
if MAR is true then the hypothesis is true. Or, we could fit a corresponding regression:
If MAR is true then the hypothesis is true.
This simple example draws out the following general points:
These two points together mean that the MAR mechanism is much more subtle than might at first appear; these subtleties can manifest themselves unexpectedly.
Then missing data are still MAR, and (1.7) is still a valid estimate.
Of course, it may be as contrived to think each individual has their own MAR mechanism as to think that the same mechanism holds for all. In a simple example this is not important, but in real applications a blanket assumption of MAR may be very contrived.
Table 1.4 Three variables: all possible missing value patterns
Faced with complex data, there is a temptation to invoke the MAR assumption too readily, especially as this simplifies any analysis using MI. To guard against this, analysts need to be satisfied that any associations assumed to justify the MAR assumption are at least consistent with the observed data. Since consideration of selection mechanisms may not be as straightforward as might first appear, it can also be worth considering the plausibility of MAR from the point of view of the joint and conditional distribution of the data. As (1.6) illustrates, for MAR we need to be satisfied
The above discussion explains why we do not regard the MAR assumption as a panacea, but nevertheless often both a plausible and practical starting point for the analysis of partially observed data. In particular, the points drawn out of Example 1.4 are not specific to either the number or type of variables (categorical or quantitative).
If the mechanism causing missing data is neither MCAR nor MAR, we say it is Missing Not At Random (MNAR). Under a MNAR mechanism, the probability of an observation being missing depends on the underlying value, and this dependence remains even given the observed data. Mathematically,
1.8
While in some settings MNAR may be more plausible than MAR, analysis under MNAR is considerably harder. This is because under MAR, equation (1.6) showed that conditional distributions of partially observed variables given fully observed variables are the same in units who do, and do not, have the data observed. However (1.6) does not hold if (1.8) holds.
It follows that inference under MNAR involves an explicit specification of either the selection mechanism, or how conditional distributions of partially observed variables given fully observed variables differ between units who do, and do not, have the data observed.
Formally, we can write the joint distribution of unit i's variables, Yi, and the indicator for observing those variables, Ri as
1.9
In the centre is the joint distribution, and this can be written either as
Thus we can specify a MNAR mechanism either by specifying the selection model (which implies the pattern mixture model) or by specifying a pattern mixture model (which implies a selection model). Depending on the context, both approaches may be helpful. Unfortunately, even in apparently simple settings, explicitly calculating the selection implication of a pattern mixture model, or vice versa, can be awkward. We shall see in Chapter 10 that an advantage of multiple imputation is that, given a pattern mixture model, we can estimate the selection model implications quite easily.
Once again, as the example below shows, MNAR is an assumption for the analysis, not a characteristic of the data.
The example above illustrates that when data are MNAR, instead of thinking about the selection mechanism, it is equally appropriate to consider differences between conditional distributions of partially observed given fully observed variables. Under MAR such distributions do not differ depending on whether data is missing or not; under MNAR they do. Considering the conditional distribution of the observed data, and then exploring the robustness of inference as it is allowed to differ in the unobserved data, is therefore a natural way to explore the robustness of inference to an assumption of MAR. From our perspective it has two further advantages: (i) the differences can be expressed simply and pictorially, and (ii) MI provides a natural route for inference. Unfortunately, the selection counterparts, or implications, of pattern mixture models are rarely easy to calculate directly, but again MI can help: after imputing missing data under a pattern mixture model, it is straightforward to explore implications for the implied selection model.
If, under a specific assumption about the missingness mechanism, we can construct a valid analysis that does not require us to explicitly include the model for that missing value mechanism, we term the mechanism, in the context of this analysis, ignorable.
A common example of this is a likelihood based analysis assuming MAR.
However, as we see below there are other settings, where we do not assume MAR, that do not require us to explicitly include the model for the missingness mechanism yet still result in valid inference. For example, as discussed in Section 1.6.2, a complete records regression analysis is valid if data are MNAR dependent only on the covariates.
We have already noted that, given the observed data, we cannot definitively identify the missingness mechanism. Nevertheless, the observed data can help frame plausible assumptions about this—in other words assumptions which are consistent with the observed data. Exploratory analyses of this nature are important for (i) assessing whether a complete records analysis is likely to be biased and (ii) framing appropriate imputation models. Two key tools for this are summaries (tabular or graphical) of fully observed, or near-fully observed variables by missingness pattern and logistic regression of missingness indicators on observed, or near-fully observed variables.
Table 1.5 Asthma study: mean FEV1 (litres) at each visit, by dropout pattern and intervention arm.
Usually, we will wish to fit some form of regression model to address our substantive questions. Here, we look at the implications, in terms of bias and loss of information, of missing data in the response and/or covariates under different missingness mechanisms. We first focus on linear regression; our findings there hold for most other regression models, including relative risk regression and survival analysis. Logistic regression is more subtle; we discuss this in Section 1.6.4.
Suppose we wish to fit the model
1.12
but Y is partially observed. Let Ri indicate whether Yi is observed. For now assume that the xi are known without error; for example it may be a design variable. Then the contribution to the likelihood for from unit i, conditional on xi, is
1.13
Assume, as will typically be the case, that the parameters of , β, are distinct from the parameters of .
Figure 1.2 suggests that, provided Y is MAR given the covariates in the model, units with missing response have no information about β. To see this formally, first observe that as Yi