Causal Inference in Statistics - Judea Pearl - E-Book

Causal Inference in Statistics E-Book

Judea Pearl

0,0
34,99 €

-100%
Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.

Mehr erfahren.
Beschreibung

CAUSAL INFERENCE IN STATISTICS

A Primer

Causality is central to the understanding and use of data. Without an understanding of cause–effect relationships, we cannot use data to answer questions as basic as "Does this treatment harm or help patients?" But though hundreds of introductory texts are available on statistical methods of data analysis, until now, no beginner-level book has been written about the exploding arsenal of methods that can tease causal information from data.

Causal Inference in Statistics fills that gap. Using simple examples and plain language, the book lays out how to define causal parameters; the assumptions necessary to estimate causal parameters in a variety of situations; how to express those assumptions mathematically; whether those assumptions have testable implications; how to predict the effects of interventions; and how to reason counterfactually. These are the foundational tools that any student of statistics needs to acquire in order to use statistical methods to answer causal questions of interest.

This book is accessible to anyone with an interest in interpreting data, from undergraduates, professors, researchers, or to the interested layperson. Examples are drawn from a wide variety of fields, including medicine, public policy, and law; a brief introduction to probability and statistics is provided for the uninitiated; and each chapter comes with study questions to reinforce the readers understanding.

Sie lesen das E-Book in den Legimi-Apps auf:

Android
iOS
von Legimi
zertifizierten E-Readern

Seitenzahl: 320

Veröffentlichungsjahr: 2016

Bewertungen
0,0
0
0
0
0
0
Mehr Informationen
Mehr Informationen
Legimi prüft nicht, ob Rezensionen von Nutzern stammen, die den betreffenden Titel tatsächlich gekauft oder gelesen/gehört haben. Wir entfernen aber gefälschte Rezensionen.



Table of Contents

Cover

Title Page

Copyright

Dedication

About the Authors

Preface

Acknowledgments

List of Figures

About the Companion Website

Chapter 1: Preliminaries: Statistical and Causal Models

1.1 Why Study Causation

1.2 Simpson's Paradox

1.3 Probability and Statistics

1.4 Graphs

1.5 Structural Causal Models

Bibliographical Notes for Chapter 1

Chapter 2: Graphical Models and Their Applications

2.1 Connecting Models to Data

2.2 Chains and Forks

2.3 Colliders

2.4

d

-separation

2.5 Model Testing and Causal Search

Bibliographical Notes for Chapter 2

Chapter 3: The Effects of Interventions

3.1 Interventions

3.2 The Adjustment Formula

3.3 The Backdoor Criterion

3.4 The Front-Door Criterion

3.5 Conditional Interventions and Covariate-Specific Effects

3.6 Inverse Probability Weighing

3.7 Mediation

3.8 Causal Inference in Linear Systems

Bibliographical Notes for Chapter 3

Chapter 4: Counterfactuals and Their Applications

4.1 Counterfactuals

4.2 Defining and Computing Counterfactuals

4.3 Nondeterministic Counterfactuals

4.4 Practical Uses of Counterfactuals

4.5 Mathematical Tool Kits for Attribution and Mediation

Bibliographical Notes for Chapter 4

References

Index

End User License Agreement

Pages

ix

xi

xii

xiii

xiv

xv

xvi

xvii

xix

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

53

54

55

56

57

58

59

60

61

62

63

64

65

66

67

68

69

70

71

72

73

74

75

76

77

78

79

80

81

82

83

84

85

86

87

88

89

90

91

92

93

94

95

96

97

98

99

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125

126

127

128

129

130

131

133

134

135

136

Guide

Cover

Table of Contents

Preface

Begin Reading

List of Illustrations

Chapter 1: Preliminaries: Statistical and Causal Models

Figure 1.1 Results of the exercise–cholesterol study, segregated by age

Figure 1.2 Results of the exercise–cholesterol study, unsegregated. The data points are identical to those of Figure 1.1, except the boundaries between the various age groups are not shown

Figure 1.3 Scatter plot of the results in Table 1.6, with the value of Die 1 on the

x

-axis and the sum of the two dice rolls on the

y

-axis

Figure 1.4 Scatter plot of the results in Table 1.6, with the value of Die 1 on the

x

-axis and the sum of the two dice rolls on the

y

-axis. The dotted line represents the line of best fit based on the data. The solid line represents the line of best fit we would expect in the population

Figure 1.5 An undirected graph in which nodes

X

and

Y

are adjacent and nodes

Y

and

Z

are adjacent but not

X

and

Z

Figure 1.6 A directed graph in which node

A

is a parent of

B

and

B

is a parent of

C

Figure 1.7 (a) Showing acyclic graph and (b) cyclic graph

Figure 1.8 A directed graph used in Study question 1.4.1

Figure 1.9 The graphical model of SCM 1.5.1, with

X

indicating years of schooling,

Y

indicating years of employment, and

Z

indicating salary

Figure 1.10 Model showing an unobserved syndrome,

Z

, affecting both treatment (

X

) and outcome (

Y

)

Chapter 2: Graphical Models and Their Applications

Figure 2.1 The graphical model of SCMs 2.1–2.3

Figure 2.2 The graphical model of SCMs 2.6 and 2.7

Figure 2.3 A simple collider

Figure 2.4 A simple collider,

Z

, with one child,

W

, representing the scenario from Table 2.3, with

X

representing one coin flip,

Y

representing the second coin flip,

Z

representing a bell that rings if either

X

or

Y

is heads, and

W

representing an unreliable witness who reports on whether or not the bell has rung

Figure 2.5 A directed graph for demonstrating conditional independence (error terms are not shown explicitly)

Figure 2.6 A directed graph in which

P

is a descendant of a collider

Figure 2.7 A graphical model containing a collider with child and a fork

Figure 2.8 The model from Figure 2.7 with an additional forked path between

Z

and

Y

Figure 2.9 A causal graph used in study question 2.4.1, all

U

terms (not shown) are assumed independent

Chapter 3: The Effects of Interventions

Figure 3.1 A graphical model representing the relationship between temperature (

Z

), ice cream sales (

X

), and crime rates (

Y

)

Figure 3.2 A graphical model representing an intervention on the model in Figure 3.1 that lowers ice cream sales

Figure 3.3 A graphical model representing the effects of a new drug, with

Z

representing gender,

X

standing for drug usage, and

Y

standing for recovery

Figure 3.4 A modified graphical model representing an intervention on the model in Figure 3.3 that sets drug usage in the population, and results in the manipulated probability

Figure 3.5 A graphical model representing the effects of a new drug, with

X

representing drug usage,

Y

representing recovery, and

Z

representing blood pressure (measured at the end of the study). Exogenous variables are not shown in the graph, implying that they are mutually independent

Figure 3.6 A graphical model representing the relationship between a new drug (

X

), recovery (

Y

), weight (

W

), and an unmeasured variable

Z

(socioeconomic status)

Figure 3.7 A graphical model in which the backdoor criterion requires that we condition on a collider (

Z

) in order to ascertain the effect of

X

on

Y

Figure 3.8 Causal graph used to illustrate the backdoor criterion in the following study questions

Figure 3.9 Scatter plot with students' initial weights on the

x

-axis and final weights on the

y

-axis. The vertical line indicates students whose initial weights are the same, and whose final weights are higher (on average) for plan B compared with plan A

Figure 3.10 A graphical model representing the relationships between smoking and lung cancer , with unobserved confounder and a mediating variable

Z

Figure 3.11 A graphical model representing the relationship between gender, qualifications, and hiring

Figure 3.12 A graphical model representing the relationship between gender, qualifications, and hiring, with socioeconomic status as a mediator between qualifications and hiring

Figure 3.13 A graphical model illustrating the relationship between path coefficients and total effects

Figure 3.14 A graphical model in which

X

has no direct effect on

Y

, but a total effect that is determined by adjusting for

T

Figure 3.15 A graphical model in which

X

has direct effect on

Y

Figure 3.16 By removing the direct edge from

X

to

Y

and finding the set of variables that

d

-separate them, we find the variables we need to adjust for to determine the direct effect of

X

on

Y

Figure 3.17 A graphical model in which we cannot find the direct effect of

X

on

Y

via adjustment, because the dashed double-arrow arc represents the presence of a backdoor path between

X

and

Y

, consisting of unmeasured variables. In this case,

Z

is an instrument with regard to the effect of

X

on

Y

that enables the identification of

Figure 3.18 Graph corresponding to Model 3.1 in Study question 3.8.1

Chapter 4: Counterfactuals and Their Applications

Figure 4.1 A model depicting the effect of Encouragement on student's score

Figure 4.2 Answering a counterfactual question about a specific student's score, predicated on the assumption that homework would have increased to

Figure 4.3 A model representing Eq. (4.7), illustrating the causal relations between college education (

X

), skills (

Z

), and salary (

Y

)

Figure 4.4 Illustrating the graphical reading of counterfactuals. (a) The original model. (b) The modified model in which the node labeled represents the potential outcome

Y

predicated on

Figure 4.5 (a) Showing how probabilities of necessity (PN) are bounded, as a function of the excess risk ratio (ERR) and the confounding factor (CF) (Eq. (4.31)); (b) showing how PN is identified when monotonicity is assumed (Theorem 4.5.1)

Figure 4.6 (a) The basic nonparametric mediation model, with no confounding. (b) A confounded mediation model in which dependence exists between and

List of Tables

Chapter 1: Preliminaries: Statistical and Causal Models

Table 1.1 Results of a study into a new drug, with gender being taken into account

Table 1.2 Results of a study into a new drug, with posttreatment blood pressure taken into account

Table 1.3 Age breakdown of voters in 2012 election (all numbers in thousands)

Table 1.4 Age breakdown of voters over the age of 29 in 2012 election (all numbers in thousands)

Table 1.5 The proportion of males and females achieving a given education level

Table 1.6 Results of 12 rolls of two fair dice

Chapter 2: Graphical Models and Their Applications

Table 2.1 Probability distribution for two flips of a fair coin, with

X

representing flip one,

Y

representing flip two, and

Z

representing a bell that rings if either flip results in heads

Table 2.2 Conditional probability distributions for the distribution in Table 2.2. (Top: Distribution conditional on . Bottom: Distribution conditional on )

Table 2.3 Probability distribution for two flips of a fair coin and a bell that rings if either flip results in heads, with

X

representing flip one,

Y

representing flip two, and

W

representing a witness who, with variable reliability, reports whether or not the bell has rung

Chapter 3: The Effects of Interventions

Table 3.1 A hypothetical data set of randomly selected samples showing the percentage of cancer cases for smokers and nonsmokers in each tar category (numbers in thousands)

Table 3.2 Reorganization of the data set of Table 3.1 showing the percentage of cancer cases in each smoking-tar category (number in thousands)

Table 3.3 Joint probability distribution for the drug-gender-recovery story of Chapter 1 (Table 1.1)

Table 3.4 Conditional probability distribution for drug users in the population of Table 3.3

Table 3.5 Probability distribution for the population of Table 3.3 under the intervention , determined via the inverse probability method

Chapter 4: Counterfactuals and Their Applications

Table 4.1 The values attained by , and in the linear model of Eqs. (4.3) and (4.4)

Table 4.2 The values attained by , and in the model of Eq. (4.7)

Table 4.3 Potential and observed outcomes predicted by the structural model of Figure 4.1 units were selected at random, with each uniformly distributed over

Table 4.4 Potential and observed outcomes in a randomized clinical trial with

X

randomized over and

Table 4.5 Experimental and nonexperimental data used to illustrate the estimation of PN, the probability that drug

x

was responsible for a person's death

Table 4.6 The expected success (

Y

) for treated () and untreated () students, as a function of their homework (

M

)

Table 4.7 The expected homework (

M

) done by treated () and untreated () students

Causal Inference in Statistics

A Primer

 

Judea Pearl

Computer Science and Statistics, University of California, Los Angeles, USA

 

Madelyn Glymour

Philosophy, Carnegie Mellon University, Pittsburgh, USA

 

Nicholas P. Jewell

Biostatistics and Statistics, University of California,Berkeley, USA

 

 

 

This edition first published 2016

© 2016 John Wiley & Sons Ltd

Registered office

John Wiley & Sons Ltd, The Atrium, Southern Gate, Chichester, West Sussex, PO19 8SQ, United Kingdom

For details of our global editorial offices, for customer services and for information about how to apply for permission to reuse the copyright material in this book please see our website at www.wiley.com.

The right of the author to be identified as the author of this work has been asserted in accordance with the Copyright, Designs and Patents Act 1988.

All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, recording or otherwise, except as permitted by the UK Copyright, Designs and Patents Act 1988, without the prior permission of the publisher.

Wiley also publishes its books in a variety of electronic formats. Some content that appears in print may not be available in electronic books.

Designations used by companies to distinguish their products are often claimed as trademarks. All brand names and product names used in this book are trade names, service marks, trademarks or registered trademarks of their respective owners. The publisher is not associated with any product or vendor mentioned in this book.

Limit of Liability/Disclaimer of Warranty: While the publisher and author have used their best efforts in preparing this book, they make no representations or warranties with respect to the accuracy or completeness of the contents of this book and specifically disclaim any implied warranties of merchantability or fitness for a particular purpose. It is sold on the understanding that the publisher is not engaged in rendering professional services and neither the publisher nor the author shall be liable for damages arising herefrom. If professional advice or other expert assistance is required, the services of a competent professional should be sought.

Library of Congress Cataloging-in-Publication Data applied for

ISBN: 9781119186847

A catalogue record for this book is available from the British Library.

Cover Image: © gmaydos/Getty

 

 

 

To my wife, Ruth, my greatest mentor.

— Judea Pearl

To my parents, who are the causes of me.

— Madelyn Glymour

To Debra and Britta, who inspire me every day.

— Nicholas P. Jewell

About the Authors

Judea Pearl is Professor of Computer Science and Statistics at the University of California, Los Angeles, where he directs the Cognitive Systems Laboratory and conducts research in artificial intelligence, causal inference and philosophy of science. He is a Co-Founder and Editor of the Journal of Causal Inference and the author of three landmark books in inference-related areas. His latest book, Causality: Models, Reasoning and Inference (Cambridge, 2000, 2009), has introduced many of the methods used in modern causal analysis. It won the Lakatosh Award from the London School of Economics and is cited by more than 9,000 scientific publications.

Pearl is a member of the National Academy of Sciences, the National Academy of Engineering, and a Founding Fellow of the Association for Artificial Intelligence. He is a recipient of numerous prizes and awards, including the Technion's Harvey Prize and the ACM Alan Turing Award for fundamental contributions to probabilistic and causal reasoning.

Madelyn Glymour is a data analyst at Carnegie Mellon University, and a science writer and editor for the Cognitive Systems Laboratory at UCLA. Her interests lie in causal discovery and in the art of making complex concepts accessible to broad audiences.

Nicholas P. Jewell is Professor of Biostatistics and Statistics at the University of California, Berkeley. He has held various academic and administrative positions at Berkeley since his arrival in 1981, most notably serving as Vice Provost from 1994 to 2000. He has also held academic appointments at the University of Edinburgh, Oxford University, the London School of Hygiene and Tropical Medicine, and at the University of Kyoto. In 2007, he was a Fellow at the Rockefeller Foundation Bellagio Study Center in Italy.

Jewell is a Fellow of the American Statistical Association, the Institute of Mathematical Statistics, and the American Association for the Advancement of Science (AAAS). He is a past winner of the Snedecor Award and the Marvin Zelen Leadership Award in Statistical Science from Harvard University. He is currently the Editor of the Journal of the American Statistical Association – Theory & Methods, and Chair of the Statistics Section of AAAS. His research focuses on the application of statistical methods to infectious and chronic disease epidemiology, the assessment of drug safety, time-to-event analyses, and human rights.

Preface

When attempting to make sense of data, statisticians are invariably motivated by causal questions. For example, “How effective is a given treatment in preventing a disease?”; “Can one estimate obesity-related medical costs?”; “Could government actions have prevented the financial crisis of 2008?”; “Can hiring records prove an employer guilty of sex discrimination?”

The peculiar nature of these questions is that they cannot be answered, or even articulated, in the traditional language of statistics. In fact, only recently has science acquired a mathematical language we can use to express such questions, with accompanying tools to allow us to answer them from data.

The development of these tools has spawned a revolution in the way causality is treated in statistics and in many of its satellite disciplines, especially in the social and biomedical sciences. For example, in the technical program of the 2003 Joint Statistical Meeting in San Francisco, there were only 13 papers presented with the word “cause” or “causal” in their titles; the number of such papers exceeded 100 by the Boston meeting in 2014. These numbers represent a transformative shift of focus in statistics research, accompanied by unprecedented excitement about the new problems and challenges that are opening themselves to statistical analysis. Harvard's political science professor Gary King puts this revolution in historical perspective: “More has been learned about causal inference in the last few decades than the sum total of everything that had been learned about it in all prior recorded history.”

Yet this excitement remains barely seen among statistics educators, and is essentially absent from statistics textbooks, especially at the introductory level. The reasons for this disparity is deeply rooted in the tradition of statistical education and in how most statisticians view the role of statistical inference.

In Ronald Fisher's influential manifesto, he pronounced that “the object of statistical methods is the reduction of data” (Fisher 1922). In keeping with that aim, the traditional task of making sense of data, often referred to generically as “inference,” became that of finding a parsimonious mathematical description of the joint distribution of a set of variables of interest, or of specific parameters of such a distribution. This general strategy for inference is extremely familiar not just to statistical researchers and data scientists, but to anyone who has taken a basic course in statistics. In fact, many excellent introductory books describe smart and effective ways to extract the maximum amount of information possible from the available data. These books take the novice reader from experimental design to parameter estimation and hypothesis testing in great detail. Yet the aim of these techniques are invariably the description of data, not of the process responsible for the data. Most statistics books do not even have the word “causal” or “causation” in the index.

Yet the fundamental question at the core of a great deal of statistical inference is causal; do changes in one variable cause changes in another, and if so, how much change do they cause? In avoiding these questions, introductory treatments of statistical inference often fail even to discuss whether the parameters that are being estimated are the relevant quantities to assess when interest lies in cause and effects.

The best that most introductory textbooks do is this: First, state the often-quoted aphorism that “association does not imply causation,” give a short explanation of confounding and how “lurking variables” can lead to a misinterpretation of an apparent relationship between two variables of interest. Further, the boldest of those texts pose the principal question: “How can a causal link between x and y be established?” and answer it with the long-standing “gold standard” approach of resorting to randomized experiment, an approach that to this day remains the cornerstone of the drug approval process in the United States and elsewhere.

However, given that most causal questions cannot be addressed through random experimentation, students and instructors are left to wonder if there is anything that can be said with any reasonable confidence in the absence of pure randomness.

In short, by avoiding discussion of causal models and causal parameters, introductory textbooks provide readers with no basis for understanding how statistical techniques address scientific questions of causality.

It is the intent of this primer to fill this gnawing gap and to assist teachers and students of elementary statistics in tackling the causal questions that surround almost any nonexperimental study in the natural and social sciences. We focus here on simple and natural methods to define causal parameters that we wish to understand and to show what assumptions are necessary for us to estimate these parameters in observational studies. We also show that these assumptions can be expressed mathematically and transparently and that simple mathematical machinery is available for translating these assumptions into estimable causal quantities, such as the effects of treatments and policy interventions, to identify their testable implications.

Our goal stops there for the moment; we do not address in any detail the optimal parameter estimation procedures that use the data to produce effective statistical estimates and their associated levels of uncertainty. However, those ideas—some of which are relatively advanced—are covered extensively in the growing literature on causal inference. We thus hope that this short text can be used in conjunction with standard introductory statistics textbooks like the ones we have described to show how statistical models and inference can easily go hand in hand with a thorough understanding of causation.

It is our strong belief that if one wants to move beyond mere description, statistical inference cannot be effectively carried out without thinking carefully about causal questions, and without leveraging the simple yet powerful tools that modern analysis has developed to answer such questions. It is also our experience that thinking causally leads to a much more exciting and satisfying approach to both the simplest and most complex statistical data analyses. This is not a new observation. Virgil said it much more succinctly than we in 29 BC:

“Felix, qui potuit rerum cognoscere causas” (Virgil 29 BC) (Lucky is he who has been able to understand the causes of things)

The book is organized in four chapters.

Chapter 1 provides the basic statistical, probabilistic, and graphical concepts that readers will need to understand the rest of the book. It also introduces the fundamental concepts of causality, including the causal model, and explains through examples how the model can convey information that pure data are unable to provide.

Chapter 2 explains how causal models are reflected in data, through patterns of statistical dependencies. It explains how to determine whether a data set complies with a given causal model, and briefly discusses how one might search for models that explain a given data set.

Chapter 3 is concerned with how to make predictions using causal models, with a particular emphasis on predicting the outcome of a policy intervention. Here we introduce techniques of reducing confounding bias using adjustment for covariates, as well as inverse probability weighing. This chapter also covers mediation analysis and contains an in-depth look at how the causal methods discussed thus far work in a linear system. Key to these methods is the fundamental distinction between regression coefficients and structural parameters, and how students should use both to predict causal effects in linear models.

Chapter 4 introduces the concept of counterfactuals—what would have happened, had we chosen differently at a point in the past—and discusses how we can compute them, estimate their probabilities, and what practical questions we can answer using them. This chapter is somewhat advanced, compared to its predecessors, primarily due to the novelty of the notation and the hypothetical nature of the questions asked. However, the fact that we read and compute counterfactuals using the same scientific models that we used in previous chapters should make their analysis an easy journey for students and instructors. Those wishing to understand counterfactuals on a friendly mathematical level should find this chapter a good starting point, and a solid basis for bridging the model-based approach taken in this book with the potential outcome framework that some experimentalists are pursuing in statistics.

Acknowledgments

This book is an outgrowth of a graduate course on causal inference that the first author has been teaching at UCLA in the past 20 years. It owes many of its tools and examples to former members of the Cognitive Systems Laboratory who participated in the development of this material, both as researchers and as teaching assistants. These include Alex Balke, David Chickering, David Galles, Dan Geiger, Moises Goldszmidt, Jin Kim, George Rebane, Ilya Shpitser, Jin Tian, and Thomas Verma.

We are indebted to many colleagues from whom we have learned much about causal problems, their solutions, and how to present them to general audiences. These include Clark and Maria Glymour, for providing patient ears and sound advice on matters of both causation and writing, Felix Elwert and Tyler VanderWeele for insightful comments on an earlier version of the manuscript, and the many visitors and discussants to the UCLA Causality blog who kept the discussion lively, occasionally controversial, but never boring (causality.cs.ucla.edu/blog).

Elias Bareinboim, Bryant Chen, Andrew Forney, Ang Li, Karthika Mohan, reviewed the text for accuracy and transparency. Ang and Andrew also wrote solutions to the study questions, which will be available on the book's website.

The manuscript was most diligently typed, processed, illustrated, and proofed by Kaoru Mulvihill at UCLA. Debbie Jupe and Heather Kay at Wiley deserve much credit for recognizing and convincing us that a book of this scope is badly needed in the field, and for encouraging us throughout the production process.

Finally, the National Science Foundation and the Office of Naval Research deserveacknowledgment for faithfully and consistently sponsoring the research that led tothese results, with special thanks to Behzad Kamgar-Parsi.

List of Figures

Figure 1.1

Results of the exercise–cholesterol study, segregated by age

Figure 1.2

Results of the exercise–cholesterol study, unsegregated. The data points are identical to those of

Figure 1.1

, except the boundaries between the various age groups are not shown

Figure 1.3

Scatter plot of the results in

Table 1.6

, with the value of Die 1 on the

x

-axis and the sum of the two dice rolls on the

y

-axis

Figure 1.4

Scatter plot of the results in

Table 1.6

, with the value of Die 1 on the

x

-axis and the sum of the two dice rolls on the

y

-axis. The dotted line represents the line of best fit based on the data. The solid line represents the line of best fit we would expect in the population

Figure 1.5

An undirected graph in which nodes

X

and

Y

are adjacent and nodes

Y

and

Z

are adjacent but not

X

and

Z

Figure 1.6

A directed graph in which node

A

is a parent of

B

and

B

is a parent of

C

Figure 1.7

(a) Showing acyclic graph and (b) cyclic graph

Figure 1.8

A directed graph used in Study question 1.4.1

Figure 1.9

The graphical model of SCM 1.5.1, with

X

indicating years of schooling,

Y

indicating years of employment, and

Z

indicating salary

Figure 1.10

Model showing an unobserved syndrome,

Z

, affecting both treatment (

X

) and outcome (

Y

)

Figure 2.1

The graphical model of SCMs 2.1–2.3

Figure 2.2

The graphical model of SCMs 2.6 and 2.7

Figure 2.3

A simple collider

Figure 2.4

A simple collider,

Z

, with one child,

W

, representing the scenario from

Table 2.3

, with

X

representing one coin flip,

Y

representing the second coin flip,

Z

representing a bell that rings if either

X

or

Y

is heads, and

W

representing an unreliable witness who reports on whether or not the bell has rung

Figure 2.5

A directed graph for demonstrating conditional independence (error terms are not shown explicitly)

Figure 2.6

A directed graph in which

P

is a descendant of a collider

Figure 2.7

A graphical model containing a collider with child and a fork

Figure 2.8

The model from

Figure 2.7

with an additional forked path between

Z

and

Y

Figure 2.9

A causal graph used in study question 2.4.1, all

U

terms (not shown) are assumed independent

Figure 3.1

A graphical model representing the relationship between temperature (

Z

), ice cream sales (

X

), and crime rates (

Y

)

Figure 3.2

A graphical model representing an intervention on the model in

Figure 3.1

that lowers ice cream sales

Figure 3.3

A graphical model representing the effects of a new drug, with

Z

representing gender,

X

standing for drug usage, and

Y

standing for recovery

Figure 3.4

A modified graphical model representing an intervention on the model in

Figure 3.3

that sets drug usage in the population, and results in the manipulated probability

Figure 3.5

A graphical model representing the effects of a new drug, with

X

representing drug usage,

Y

representing recovery, and

Z

representing blood pressure (measured at the end of the study). Exogenous variables are not shown in the graph, implying that they are mutually independent

Figure 3.6

A graphical model representing the relationship between a new drug (

X

), recovery (

Y

), weight (

W

), and an unmeasured variable

Z

(socioeconomic status)

Figure 3.7

A graphical model in which the backdoor criterion requires that we condition on a collider (

Z

) in order to ascertain the effect of

X

on

Y

Figure 3.8

Causal graph used to illustrate the backdoor criterion in the following study questions

Figure 3.9

Scatter plot with students' initial weights on the

x

-axis and final weights on the

y

-axis. The vertical line indicates students whose initial weights are the same, and whose final weights are higher (on average) for plan B compared with plan A

Figure 3.10

A graphical model representing the relationships between smoking and lung cancer , with unobserved confounder and a mediating variable

Z

Figure 3.11

A graphical model representing the relationship between gender, qualifications, and hiring

Figure 3.12

A graphical model representing the relationship between gender, qualifications, and hiring, with socioeconomic status as a mediator between qualifications and hiring

Figure 3.13

A graphical model illustrating the relationship between path coefficients and total effects

Figure 3.14

A graphical model in which

X

has no direct effect on

Y

, but a total effect that is determined by adjusting for

T

Figure 3.15

A graphical model in which

X

has direct effect on

Y

Figure 3.16

By removing the direct edge from

X

to

Y

and finding the set of variables that

d

-separate them, we find the variables we need to adjust for to determine the direct effect of

X

on

Y

Figure 3.17

A graphical model in which we cannot find the direct effect of

X

on

Y

via adjustment, because the dashed double-arrow arc represents the presence of a backdoor path between

X

and

Y

, consisting of unmeasured variables. In this case,

Z

is an instrument with regard to the effect of

X

on

Y

that enables the identification of

Figure 3.18

Graph corresponding to Model 3.1 in Study question 3.8.1

Figure 4.1

A model depicting the effect of Encouragement on student's score

Figure 4.2

Answering a counterfactual question about a specific student's score, predicated on the assumption that homework would have increased to

Figure 4.3

A model representing Eq. (4.7), illustrating the causal relations between college education (

X

), skills (

Z

), and salary (

Y

)

Figure 4.4

Illustrating the graphical reading of counterfactuals. (a) The original model. (b) The modified model in which the node labeled represents the potential outcome

Y

predicated on

Figure 4.5

(a) Showing how probabilities of necessity (PN) are bounded, as a function of the excess risk ratio (ERR) and the confounding factor (CF) (Eq. (4.31)); (b) showing how PN is identified when monotonicity is assumed (Theorem 4.5.1)

Figure 4.6

(a) The basic nonparametric mediation model, with no confounding. (b) A confounded mediation model in which dependence exists between and

About the Companion Website

This book is accompanied by a companion website:

www.wiley.com/go/Pearl/Causality

Chapter 1Preliminaries: Statistical and Causal Models

1.1 Why Study Causation

The answer to the question “why study causation?” is almost as immediate as the answer to “why study statistics.” We study causation because we need to make sense of data, to guide actions and policies, and to learn from our success and failures. We need to estimate the effect of smoking on lung cancer, of education on salaries, of carbon emissions on the climate. Most ambitiously, we also need to understand how and why causes influence their effects, which is not less valuable. For example, knowing whether malaria is transmitted by mosquitoes or “mal-air,” as many believed in the past, tells us whether we should pack mosquito nets or breathing masks on our next trip to the swamps.

Less obvious is the answer to the question, “why study causation as a separate topic, distinct from the traditional statistical curriculum?” What can the concept of “causation,” considered on its own, tell us about the world that tried-and-true statistical methods can't?

Quite a lot, as it turns out. When approached rigorously, causation is not merely an aspect of statistics; it is an addition to statistics, an enrichment that allows statistics to uncover workings of the world that traditional methods alone cannot. For example, and this might come as a surprise to many, none of the problems mentioned above can be articulated in the standard language of statistics.

To understand the special role of causation in statistics, let's examine one of the most intriguing puzzles in the statistical literature, one that illustrates vividly why the traditional language of statistics must be enriched with new ingredients in order to cope with cause–effect relationships, such as the ones we mentioned above.

1.2 Simpson's Paradox

Named after Edward Simpson (born 1922), the statistician who first popularized it, the paradox refers to the existence of data in which a statistical association that holds for an entire population is reversed in every subpopulation. For instance, we might discover that students who smoke get higher grades, on average, than nonsmokers get. But when we take into account the students' age, we might find that, in every age group, smokers get lower grades than nonsmokersget. Then, if we take into account both age and income, we might discover that smokers once again get higher grades than nonsmokers of the same age and income. The reversals may continue indefinitely, switching back and forth as we consider more and more attributes. In this context, we want to decide whether smoking causes grade increases and in which direction and by how much, yet it seems hopeless to obtain the answers from the data.

In the classical example used by Simpson (1951), a group of sick patients are given the option to try a new drug. Among those who took the drug, a lower percentage recovered than among those who did not. However, when we partition by gender, we see that more men taking the drug recover than do men are not taking the drug, and more women taking the drug recover than do women are not taking the drug! In other words, the drug appears to help men and women, but hurt the general population. It seems nonsensical, or even impossible—which is why, of course, it is considered a paradox. Some people find it hard to believe that numbers could even be combined in such a way. To make it believable, then, consider the following example:

Example 1.2.1

We record the recovery rates of 700 patients who were given access to the drug. A total of 350 patients chose to take the drug and 350 patients did not. The results of the study are shown in Table 1.1.

Table 1.1 Results of a study into a new drug, with gender being taken into account

Drug

No drug

Men

81 out of 87 recovered (93%)

234 out of 270 recovered (87%)

Women

192 out of 263 recovered (73%)

55 out of 80 recovered (69%)

Combined data

273 out of 350 recovered (78%)

289 out of 350 recovered (83%)

The first row shows the outcome for male patients; the second row shows the outcome for female patients; and the third row shows the outcome for all patients, regardless of gender. In male patients, drug takers had a better recovery rate than those who went without the drug (93% vs 87%). In female patients, again, those who took the drug had a better recovery rate than nontakers (73% vs 69%). However, in the combined population, those who did not take the drug had a better recovery rate than those who did (83% vs 78%).