76,99 €
Understanding Biostatistics looks at the fundamentals of biostatistics, using elementary statistics to explore the nature of statistical tests.
This book is intended to complement first-year statistics and biostatistics textbooks. The main focus here is on ideas, rather than on methodological details. Basic concepts are illustrated with representations from history, followed by technical discussions on what different statistical methods really mean. Graphics are used extensively throughout the book in order to introduce mathematical formulae in an accessible way.
Key features:
This book will be useful for biostatisticians with little mathematical background as well as those who want to understand the connections in biostatistics and mathematical issues.
Sie lesen das E-Book in den Legimi-Apps auf:
Seitenzahl: 835
Veröffentlichungsjahr: 2011
Contents
Cover
Statistics in Practice
Title Page
Copyright
Preface
Chapter 1: Statistics and medical science
1.1 Introduction
1.2 On the Nature of Science
1.3 How the Scientific Method Uses Statistics
1.4 Finding an Outcome Variable to Assess Your Hypothesis
1.5 How we Draw Medical Conclusions from Statistical Results
1.6 A Few Words about Probabilities
1.7 The Need for Honesty: The Multiplicity Issue
1.8 Prespecification and p-Value History
1.9 Adaptive Designs: Controlling the Risks in an Experiment
1.10 The Elusive Concept of Probability
1.11 Comments and Further Reading
References
Chapter 2: Observational studies and the need for clinical trials
2.1 Introduction
2.2 Investigations of Medical Interventions and Risk Factors
2.3 Observational Studies and Confounders
2.4 The Experimental Study
2.5 Population Risks and Individual Risks
2.6 Confounders, Simpson's Paradox and Stratification
2.7 On Incidence and Prevalence in Epidemiology
2.8 Comments and Further Reading
References
Chapter 3: Study design and the bias issue
3.1 Introduction
3.2 What Bias is All About
3.3 The Need for a Representative Sample: On Selection Bias
3.4 Group Comparability and Randomization
3.5 Information Bias in a Cohort Study
3.6 The Study, or Placebo, Effect
3.7 The Curse of Missing Values
3.8 Approaches to Data Analysis: Avoiding Self-Inflicted Bias
3.9 On Meta-Analysis and Publication Bias
3.10 Comments and Further Reading
References
Chapter 4: The anatomy of a statistical test
4.1 Introduction
4.2 Statistical Tests, Medical Diagnosis and Roman Law
4.3 The Risks with Medical Diagnosis
4.4 The Law: a Non-quantitative Analogue
4.5 Risks in Statistical Testing
4.6 Making Statements about a Binomial Parameter
4.7 The Bell-Shaped Error Distribution
4.8 Comments and Further Reading
References
4.A Appendix: The Evolution of the Central Limit Theorem
Chapter 5: Learning about parameters, and some notes on planning
5.1 Introduction
5.2 Test Statistics Described by Parameters
5.4 Statistical Analysis of Two Proportions
5.5 Adjusting for Confounders in the Analysis
5.6 The Power Curve of an Experiment
5.7 Some Confusing Aspects of Power Calculations
5.8 Comments and Further Reading
References
5.A Appendix: Some Technical Comments
Chapter 6: Empirical distribution functions
6.1 Introduction
6.2 How to Describe the Distribution of a Sample
6.3 Describing the Sample: Descriptive Statistics
6.4 Population Distribution Parameters
6.5 Confidence in the CDF and Its Parameters
6.6 Analysis of Paired Data
6.7 Bootstrapping
6.8 Meta-Analysis and Heterogeneity
6.9 Comments and Further Reading
References
6.A Appendix: Some Technical Comments
Chapter 7: Correlation and regression in bivariate distributions
7.1 Introduction
7.2 Bivariate Distributions and Correlation
7.3 On Baseline Corrections and Other Covariates
7.4 Bivariate Gaussian Distributions
7.5 Regression to the Mean
7.6 Statistical Analysis of Bivariate Gaussian Data
7.7 Simultaneous Analysis of Two Binomial Proportions
7.8 Comments and Further Reading
References
7.A Appendix: Some Technical Comments
Chapter 8: How to compare the outcome in two groups
8.1 Introduction
8.2 Simple Models that Compare Two Distributions
8.3 Comparison Done the Horizontal Way
8.4 Analysis Done the Vertical Way
8.5 Some Ways to Compute p-Values
8.6 The Discrete Wilcoxon Test
8.7 The Two-Period Crossover Trial
8.8 Multivariate Analysis and Analysis of Covariance
8.9 Comments and Further Reading
References
8.A Appendix: About U-statistics
Chapter 9: Least squares, linear models and beyond
9.1 Introduction
9.2 The Purpose of Mathematical Models
9.3 Different Ways to do Least Squares
9.4 Logistic Regression, with Variations
9.5 The Two-Step Modeling Approach
9.6 The Effect of Missing Covariates
9.7 The Exponential Family of Distributions
9.8 Generalized Linear Models
9.9 Comments and Further Reading
References
Chapter 10: Analysis of dose response
10.1 Introduction
10.2 Dose–Response Relationship
10.3 Relative Dose Potency and Therapeutic Ratio
10.4 Subject-Specific and Population Averaged Dose Response
10.5 Estimation of the Population Averaged Dose–Response Relationship
10.6 Estimating Subject-Specific Dose Responses
10.7 Comments and Further Reading
References
Chapter 11: Hazards and censored data
11.1 Introduction
11.2 Censored Observations: Incomplete Knowledge
11.3 Hazard Models from a Population Perspective
11.4 The Impact of Competing Risks
11.5 Heterogeneity in Survival Analysis
11.6 Recurrent Events and Frailty
11.7 The Principles Behind the Analysis of Censored Data
11.8 The Kaplan–Meier Estimator of the CDF
11.9 Comments and Further Reading
References
11.A Appendix: On the Large-Sample Approximations of Counting Processes
Chapter 12: From the log-rank test to the Cox proportional hazards model
12.1 Introduction
12.2 Comparing Hazards Between two Groups
12.3 Nonparametric Tests for Hazards
12.4 Parameter Estimation in Hazard Models
12.5 The Accelerated Failure Time Model
12.6 The Cox Proportional Hazards Model
12.7 On Omitted Covariates and Stratification in the Log-Rank Test
12.8 Comments and Further Reading
References
12.A Appendix: Comments on Interval-Censored Data
Chapter 13: Remarks on some estimation methods
13.1 Introduction
13.2 Estimating Equations and the Robust Variance Estimate
13.3 From Maximum Likelihood Theory to Generalized Estimating Equations
13.4 The Analysis of Recurrent Events
13.5 Defining and Estimating Mixed Effects Models
13.6 Comments and Further Reading
References
13.A Appendix: Formulas for First-Order Bias
Index
Statistics in Practice
Statistics in Practice
Series Advisors
Human and Biological Sciences
Stephen Senn
University of Glasgow, UK
Earth and Environmental Sciences
Marian Scott
University of Glasgow, UK
Industry, Commerce and Finance
Wolfgang Jank
University of Maryland, USA
Statistics in Practice is an important international series of texts which provide detailed coverage of statistical concepts, methods and worked case studies in specific fields of investigation and study.
With sound motivation and many worked practical examples, the books show in down-toearth terms how to select and use an appropriate range of statistical techniques in a particular practical field within each title's special topic area.
The books provide statistical support for professionals and researchworkers across a range of employment fields and research environments. Subject areas covered include medicine and pharmaceutics; industry, finance and commerce; public services; the earth and environmental sciences, and so on.
The books also provide support to students studying statistical courses applied to the above areas. The demand for graduates to be equipped for the work environment has led to such courses becoming increasingly prevalent at universities and colleges.
It is our aim to present judiciously chosen and well-written workbooks to meet everyday practical needs. Feedback of views from readers will be most valuable to monitor the success of this aim.
A complete list of titles in this series appears at the end of the volume.
This edition first published 2011
© 2011 John Wiley & Sons, Ltd
Registered office
John Wiley & Sons Ltd, The Atrium, Southern Gate, Chichester, West Sussex, PO19 8SQ, United Kingdom
For details of our global editorial offices, for customer services and for information about how to apply for permission to reuse the copyright material in this book please see our website at www.wiley.com.
The right of the author to be identified as the author of this work has been asserted in accordance with the Copyright, Designs and Patents Act 1988.
All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, recording or otherwise, except as permitted by the UK Copyright, Designs and Patents Act 1988, without the prior permission of the publisher.
Wiley also publishes its books in a variety of electronic formats. Some content that appears in print may not be available in electronic books.
Designations used by companies to distinguish their products are often claimed as trademarks. All brand names and product names used in this book are trade names, service marks, trademarks or registered trademarks of their respective owners. The publisher is not associated with any product or vendor mentioned in this book. This publication is designed to provide accurate and authoritative information in regard to the subject matter covered. It is sold on the understanding that the publisher is not engaged in rendering professional services. If professional advice or other expert assistance is required, the services of a competent professional should be sought.
Library of Congress Cataloging-in-Publication Data
Källén, Anders, author.
Understanding biostatistics / Anders Källén, Department of Biostatistics, AstraZeneca, Sweden.
p.; cm.
Includes bibliographical references and index.
ISBN 978-0-470-66636-4 (print) – ISBN 978-1-119-99268-4 (epdf) – ISBN 978-1-119-99267-7 (obook) – ISBN 978-1-119-99350-6 (epub) – ISBN 978-1-119-99351-3 (mobi)
1. Biometry. I. Title.
[DNLM: 1. Biostatistics. 2. Models, Statistical. WA 950]
QH323.5.K35 2011
570.1#x0027;5195–dc22
2010051070
A catalogue record for this book is available from the British Library.
Print ISBN: 978-0-470-66636-4
ePDF ISBN: 978-1-119-99268-4
oBook ISBN: 978-1-119-99267-7
ePub ISBN: 978-1-119-99350-6
Mobi ISBN: 978-1-119-99351-3
Preface
The fact that you use biostatistics in your work does not say much about who you are. You may be a physician who has collected some data and is trying to write up a publication, or you may be a theoretical statistician who has been consulted by a physician, who has in turn collected some data and is trying to write up a publication. Whichever you are, or if you are something in between, such as a biostatistician working in a pharmaceutical company, the chances are that your perception of statistics to a large extent is driven by what a particular statistical software package can do. In fact, many books on biostatistics today seem to be more or less extended manuals for some particular statistical software. Often there is only one software package available to you, and the analysis you do on your data is governed by your understanding of that software. This is particularly apparent in the pharmaceutical industry.
However, doing biostatistics is not a technical task in which the ability to run software defines excellence. In fact, using a piece of software without the proper understanding of why you want to employ statistical methods at all, and what these methods actually provide, is bad statistics, however well versed you are in your software manual and code writing. The fundamental ingredient of biostatistics is not a software package, but an understanding of (1) whatever biological/medical aspect the data describe and (2) what it is statistics actually contribute. Statistics as a science is a subdiscipline of mathematics, and a proper description of it requires mathematical formulas. To hide this mathematical content within the inner workings of a particular software package must lead to an insufficient understanding of the true nature of the results, and is not beneficial to anyone.
Despite its title, this book is not an introduction to biostatistics aimed at laymen. This book is about the concepts, including the mathematical ones, of the more elementary aspects of biostatistics, as applied to medical problems. There are many excellent texts on medical statistics but one cannot cover everything and many of them emphasize the technical aspects of producing an analysis at the expense of the mathematical understanding of how the result is obtained. In this book the emphasis is reversed. These other books have a more systematic treatment of different types of problems and how you obtain the statistical results on different types of data. The present volume differs from others in that it is more concerned with ideas, both the particular aspects concerned with the role of statistics in the scientific process of obtaining evidence, and the mathematical ideas that constitute the basis of the subject. It is not a textbook, but should be seen as complementary to more traditional textbooks; it looks at the subject from a different angle, without being in conflict with them. It uses non-conventional and alternative approaches to some statistical concepts, without changing their meaning in any way. One such difference is that key computational aspects are often replaced by graphs, to illustrate what you are doing instead of how.
The ambition to discuss a wide range of concepts in one book is a challenge. Some concepts are philosophical in nature, others are mathematical, and we try to cover both. Broadly speaking, the book is divided into three major parts. The first part, Chapters 1–5, is concerned with what statistics contribute to medical research, and discusses not only the underlying philosophy but also various issues that are related to the art of drawing conclusions from statistical output. For this we introduce the concept of the confidence function, which helps us obtain both p-values and confidence intervals from graphics alone. In this part of the book we mostly discuss only the simplest of statistical data, in the form of proportions. We need a background to have the discussion on, and this simple case contains almost all of the conceptual problems in statistics.
The second part consists of Chapters 6–8, and is about generalizing frequency data to more general data. We emphasize the difference between the observed and the infinite truth, how population distributions are estimated by empirical (observed) distributions. We also introduce bivariate distributions, correlation and the important law of nature called ‘regression to the mean’. These chapters show how we can extend the way we compare proportions for two groups to more general data, and in the process emphasize that in order to analyze data, you need to understand what kind of group difference you want to describe. Is it a horizontal shift (like the t-test) or a vertical difference (non-parametric tests)? A general theme here, and elsewhere, is that model parameters are mostly estimated from a natural condition, expressed as an estimating equation, and not really from a probability model. There are intimate connections between these, but this view represents a change to how estimation is discussed in most textbooks on statistics.
The third part, the next four chapters, is more mathematical and consists of two subparts: the first discusses how and why we adjust for explanatory variables in regression models and the other is about what it is that is particular about survival data. There are a few common themes in these chapters, some of which are build-ups from the previous chapters. One such theme is heterogeneity and its impact on what we are doing in our statistical analysis. In biology, patients differ. With some of the most important models, based on Gaussian data, this does not matter much, whereas it may be very important for non-linear models (including the much used logistic model), because there may be a difference between what we think we are doing and what we actually are doing; we may think we are estimating individual risks, when in fact we are estimating population risks, which is something different. In the particular case of survival data we show how understanding the relationship between the population risk and the individual risks leads to the famous Cox proportional hazards model.
The final chapter, Chapter 13, is devoted to a general tie-up of a collection of mathematical ideas, spread out in the previous chapter. The theme is estimation, which is discussed from the perspective of estimating equations instead of the more traditional likelihood methods. You can have an estimating equation for a parameter that makes sense, even though it cannot be derived from any appropriate statistical model, and we will discuss how we still can make some meaningful inference.
As the book develops, the type of data discussed grows more and more complicated, and with it the mathematics that is involved. We start with simple data for proportions, progress to general complete univariate data (one data point per individual), move on to consider censored data and end up with repeated measurements. The methods described are developed by analogy and we see, for example, the Wilcoxon test appear in different disguises.
The mathematical complexity increases, more or less monotonically, with chapter number, but also within chapters. On most occasions, if the math becomes too complicated for you to understand the idea, you should move to the next chapter, which in most cases start out simpler. The mathematical theory is not described in a coherent and logical way, but as it applies locally to what is primarily a statistical discussion, and it is described in a variety of different ways: to some extent in running text, with more complex matters isolated in stand-alone text boxes, while even more complex aspects are summarized in appendices. These appendices are more like isolated overviews of some piece of mathematics relevant to the chapter in question. All mathematical notation is explained, albeit sometimes rather intuitively, and for some readers it may be wise to ‘hum’ their way through some more complicated formulas. In that way it should be possible to read at least half the book with only minor mathematical skills, as long as one is not put off by the existence of such equations as one comes across. (If you are put off by formulas, you need to get another book.) As already mentioned, at least some of the repetitive (and boring) calculations in statistics have been replaced by an extensive use of graphs. In this way the book attempts to do something that is probably considered almost impossible by most: to simultaneously speak to peasants in peasant language and the learned in Latin (this is a free translation of an old Swedish saying). But there is a price to pay for targeting a wide audience: we cannot give each individual reader the explanation that he or she would find the most helpful. No one will find every single page useful. Some parts will be only too trivial to some, whereas some parts will be incomprehensible to others. There are therefore different levels at which this book can be read.
If you are medically trained and have worked with statistics, in particular p-values, to some extent, your main hurdle will probably be the mathematics. Your priority should be to understand what things intuitively mean, not only the statistical philosophy but also different statistical tests. There is no specific predefined level of mathematics, above basic high-school math, that you need to master for most parts of the book. You only need some basic understanding of what it is a formula tries to say, in order to grasp the story, and you do not need to understand the details of different formulas. To understand a mathematical formula can mean different things, and all formulas definitely do not need to be understood by everyone. The non-trivial mathematics is essentially only that of differentiation and integration, in particular the latter, which most people in the target readership are expected to have encountered at least to some degree. An integral is essentially only a sum, albeit made up of a vast number of very, very small pieces. If you see an integral, it may well suffice to look upon it as a simple sum and, instead of getting agitated, leave such higher-calculus formulas to be read by those with more mathematical interest and skill.
On a second level, you may be a reader who has had basic training in statistics and is working with biostatistics. Being a professional statistician nowadays does not necessarily mean that you have much mathematical training. Hopefully you can make sense of most of the equations, but you may need to consult a standard textbook or other references for further details.
The third level is when you are well versed in reading mathematical textbooks, deriving formulas and proving theorems. For you, the main reason for reading this book may be to get an introduction to biostatistics in order to see whether you want to learn more about the subject. For you, the lack of mathematical details should not be a problem; most left-out steps are probably easily filled in. At this point I beg the indulgence of any mathematician who has ventured into this book and who sees that mathematical derivations are not completely rigorous but sacrificed for the sake of more intuitive ‘explanation’. It must also be noted that this book is not an introduction to what to consider when you work as a biostatistician. It may be helpful in some respects, but there is most often an initial hurdle to such work, not addressed in this book, which is about being able to translate biological or medical insight and assumptions to the proper statistical question.
These three levels represent a continuum of mathematical skills. But remember that this book is not a textbook. We use mathematics as a tool for description, an essential tool, but we do not transform biostatistics into a mathematical subdiscipline. One aspect of mathematics is notation. Proper and consistent use of mathematical notation is fundamental to mathematics. In this book we do not have such aspirations, and are therefore occasionally slack in our use of notation. Our notation is not consistent between chapters, and sometimes not even within chapters. Notation is always local, optimized for the present discussion, sacrificing consistency throughout. On most occasions we use capital letters to denote stochastic variables and lower case letters to denote observations, but occasionally we let lower case letters denote stochastic variables. Sometimes we are not even explicit about the change from the observation to the corresponding stochastic variable. Another example is that is not always well defined whether a vector is a column vector or row vector, it may change state almost within a sentence. If you know the importance of this distinction, you can probably identify which it is from the context. This sacrifice is made because I believe it increases readability.
All chapters end with some suggestions on further reading. These are unpretentious and incomplete listings, and are there to acknowledge some material from which I have derived some inspiration when writing this book.
I am deeply grateful to Professor Stephen Senn for the strong support he has given to the project of finalizing this book and for the invaluable advice he has given in the course of so doing. It has been a long-standing wish of mine to write this book, but without his support it is very doubtful that it would ever have happened. I also want to give credit to all those (to me unknown) providers of information on the internet from which I have borrowed, or stolen, a phrase now and then, because it sounded much better than the Swenglish way I would have written it myself. In addition, I want to thank a number of present or past colleagues at the AstraZeneca site where I have worked for the past 25 years, but which the company has decided to close down at more or less the same time as this book is published, in particular Tore Persson and Tobias Rydén, who, despite conflicting priorities, provided helpful comments. Finally, I also want to thank my father and Olivier Guilbaud for input at earlier stages of this project.
This book was written in LaTeX, and the software used for computations and graphics was the high level matrix programming language GAUSS, distributed by Aptech Systems of Maple Valley, Washington. Graphs were produced using the free software Asymptote.
The Cochrane Collaboration logo in Chapter 3 is reproduced by permission of Cochrane Library.
Anders Källén
Lund, October 2010
Chapter 1
Statistics and Medical Science
1.1 Introduction
Many medical researchers have an ambiguous relationship with statistics. They know they need it to be able to publish their results in prestigious academic journals, as opposed to general public tabloids, but they also think that it unnecessarily complicates what should otherwise be straightforward interpretations. The most frustrated medical researchers can probably be found among those who actually do consult biostatisticians; they only too often experience criticism of the design of the experiment they want to do or, worse, have done – as if the design was the business of the statistician at all.
On the other hand, if you ask biostatisticians, they often consider medical science a contradiction in terms. Tradition, subjectivity and intuitive thinking seem to be such an integral part of the medical way of thinking, they say, that it cannot be called science. And biostatisticians feel fully vindicated by the hype that surrounded the term ‘evidence-based medicine’ during the 1990s. Evidence? Isn't that what research should be all about? Isn't it a bit late to realize that now?
This chapter attempts to explain what statistics actually contributes in clinical research. We will describe, from a bird's-eye perspective, the structure within which statistics operates, and the nature of its results. We will use most of the space to describe the true nature of one particular summary statistic, the p-value. Not because it necessarily is the right thing to compute, but because all workers in biostatistics have encountered it. How it is computed will be discussed in later chapters (though more emphasis will be put on its relative, the confidence interval).
Medicine is not a science per se. It is an engineering application of biology to human disease. Medicine is about diagnosing and treating individual patients in accordance with tradition and established knowledge. It is a highly subjective activity in which the physician uses his own and others' experiences to find a diagnostic fit to the signs and symptoms of a particular patient, in order to identify the appropriate treatment.
For most of its history, medicine has been about individual patients, and about inductive reasoning. Inductive reasoning is when you go from the particular to the general, as in ‘all crows I have seen have been black, therefore all crows are black’. It is the way we, as individuals, learn about reality when we grow up. However, as a foundation of science, induction has in most cases been replaced by the method of falsification, as discussed in Box 1.1. (It is of course not the case that medicine is exclusively about inductive reasoning: a diagnostic fit may well be put to the test in a process of falsification.)
Box 1.1 The Philosophy of Science
What is knowledge about reality and how is it acquired? The first great scholar of nature, Aristotle, divided knowledge into two categories, the original facts (axioms) and the deduced facts. Deduction is done by (deductive) logic in which propositions are derived from one or more premises, following certain rules. It often takes the shape of mathematics. When applied to natural phenomena, the problem are the premises. In a deductive science like mathematics there is a process to identify them, but in empirical sciences their nature is less obvious. So how do we identify them?
Early thinkers promoted the idea of induction. When repeated observations of nature fall into some pattern in the mind of the observer, they are said to induce a suggestion of a more general fact. This idea of induction was raised to an alternative form of logic, inductive logic, which forced a fact from multiple observations, a view which was vigorously criticized by David Hume in the mid-eighteenth century.
Hume's argument started with an analysis of causal relations, which he claimed were found exclusively by induction, never deduction, and contains an implicit assumption that unobserved objects resemble observed ones. The causal connection is by induction, not deduction, and the justification of the inductive process becomes a circular argument, Hume argues. This was referred to as ‘Hume's dilemma’, something that upset Immanuel Kant so much that he referred to the problem of induction as the ‘scandal of philosophy’. This does not mean that if we have always observed something in a particular situation, we should not expect the same to happen next time. It means that it cannot be an absolute fact, and instead we are making a prediction, with some degree of confidence.
Two centuries later Karl Popper introduced refutationism. According to this there are no empirical, absolute facts and science does not rely on induction, but exclusively on deduction. We state working hypotheses about nature, the validity of which we test in experiments. Once refuted, a modified hypothesis is formulated and put to the test. And so on. This infinite cycle of conjecture and refutation is the true nature of science, according to Popper.
As an example, used by Hume, ‘No amount of observations of white swans can allow the inference that all swans are white, but the observation of a single black swan is sufficient to refute that conclusion'. It was a long-held belief in Europe that all swans were white, until Australia was discovered, and with it Cygnus atratus, the black swan.
Inductionism and refutationism both have their counterparts in the philosophy of statistics. In the Bayesian approach to statistics, which is inductive, we start with a summary of what we believe and update that according to experimental results. The frequentist approach, on the other hand, is one of refuting hypothesis. Each case is unique and the data of the particular experiment settle that case alone.
Another peculiarity of medicine is ethics. Medical researchers are very careful not to put any patients at risk in obtaining the information they seek. This is often a complicating factor in clinical research when it interferes with the research objective of a clinical trial. For example, in drug development, at one important stage we need to show that a particular drug is effective. The scientific way to do this is by carrying out a clinical trial in which the response to the drug is compared to the response when no treatment is given. Everything else should be the same. However, in the presence of other effective drugs, it may not at all be ethical to withhold a useful drug for the sole reason that you want to demonstrate that a new drug is also effective.
Finally, there is the general problem of why it appears to be so hard for many physicians to understand basic statistical reasoning: what conclusions one may draw and why. To be honest, part of the reason why statistics is so hard to understand for non-statisticians is probably that statisticians have not figured it out for themselves. There is not one statistical philosophy that forms the basis for statistical reasoning, there are a number of them: frequentists versus Bayesians, Fisher's approach versus the Neyman–Pearson view. If statisticians cannot figure it out, how can they expect their customers to be able to do so?
These are some properties of medical researchers that statisticians should be aware of. Of course, they are not true statements about individual medics. They are statements about the group of medics, and statements about groups are what statistics is all about. This will be our starting point in Chapter 2 when we initiate a more serious discussion about the design of clinical trials. But before we do that we need to get a basic understanding of what it is statistics is trying to do. This journey will start with an attempt to describe the role of statistics within science.
1.2 On the Nature of Science
For almost all of the history of mankind the approach to health has been governed by faith, superstition and magic, often expressed as witchcraft. This has gradually changed since the period of the Enlightenment in the eighteenth century, so that doctors can no longer make empty assertions and quacks can no longer sell useless cures with impunity. The factor that has changed this is what we call science.
But what is science? We know what it does: it helps us understand and make sense of the world around us. But that does not define science; religion has served much the same purpose for most of mankind's history. Science is often divided into three subsets: natural sciences (the study of natural phenomena), social sciences (the study of human behavior and society), and mathematics (including statistics). The first two of these are empirical sciences, in which knowledge is based on observable phenomena, whereas mathematics is a deductive science in which new knowledge is deduced from previous knowledge. There is also applied science, engineering, which is the application of scientific research to specific human needs. The use of statistics in medical research is an example, as is medicine itself.
The science of mathematics has a specific structure. Starting from a basic set of definitions and assumptions (usually called axioms), theorems are formulated and proved. A theorem constitutes a mathematical statement, and its proof is a logical chain of applications of previously proved theorems. A collection of interlinked, proved, mathematical theorems makes up a mathematical theory of something. The empirical sciences are similar to this in many respects, but differ fundamentally in others. Corresponding to an unproved mathematical theorem is a hypothesis about nature. The mathematical proof corresponds to an experiment that tests the hypothesis. A theory, in the context of empirical science, consists of a number of not yet refuted hypotheses which are bound together by some common theme.
What we think we know about the world is very much the result of an inductive process, derived from experiences and learning. The difference between science and religion is not about content, but about the way knowledge is obtained. A statement can only be a scientific statement if it can be tested, and science is qualified by the extent to which its predictions are borne out; when a model fails a test it has to be modified. Science is therefore not static, it is dynamic. Old ‘truths’ are replaced by new ‘truths’. It is like an enormous jigsaw puzzle in which pieces are constantly replaced and added. Sometimes replacement is with a set of new pieces that give a clearer picture of the overall puzzle, sometimes a piece turns out to be wrong and needs to be replaced by a new, fundamentally different, one. Sometimes we need to tear up an entire part of the jigsaw puzzle and rebuild it. The basic requirement of the individual pieces in this jigsaw puzzle is that each one addresses a question that can be tested for validity. Science is a humble practice; it tells us that we know nothing unless we have evidence and that our state of knowledge must always be open to scrutiny and challenge.
The fundamental difference between empirical sciences and mathematics is that a mathematical proof proves the hypothesis (i.e., theorem), whereas in empirical sciences experiments are designed to disprove the hypothesis. A particular hypothesis can be refuted by an observation that is inconsistent with the hypothesis. But the hypothesis cannot be proved by experiment – all we can say is that the outcome of the experiment is consistent with it.
Example 1.1
Like most people before modern times, the Greeks thought that the earth was the center of everything. They identified seven moving objects in heaven – five planets, the sun and the moon – and Ptolemy worked out a very elaborate model for how they move, using only circles and circles moving on circles (epicycles). The result was an explanation of the heavens (planets, at least) that fulfilled all the criteria of science. They made predictions that could be tested, and these never failed. When the idea of putting the sun at the center of this system emerged, it was not found to work better in any way; it did not produce better predictions than the Greek model. It was not until Johannes Kepler managed to identify his famous three laws that astronomers actually got a sun-centered description of the heavens that even matched the Greek version. This meant that there were two competing models with no one really ahead.
However, this changed with Isaac Newton. With his law of gravitation the science of the heavens took a gigantic leap forward. In one go, he reduced the complex behavior of the planets to a few fundamental and universal laws. When these laws were applied to the planets they not only predicted their movements to any precision measurable, they also allowed a new planet to be discovered (Neptune, in 1846). So many experiments were conducted over hundreds of years with outcomes consistent with Newton's theory, that it was very tempting to consider it a true fact. However, during the twentieth century some astronomical observations were made that were inconsistent with the mathematical predictions of the theory, and it is today superseded by Albert Einstein's theory of general relativity in cosmology. As a theory though, Newton's theory of gravitation is still good enough to be used for all everyday activities involving gravitation, such as sending people to the moon.
This example illustrates an important point about science which must be kept in mind, namely that ‘all models are wrong, but some are useful’, a quotation often attributed to the English statistician George Box. Much of the success of Newton's physics was due to the fact that it was expressed in mathematical terms. As a general rule scientific theory seems to be least controversial when it can be expressed in the form of mathematical relationships. This is partly because this requires a rather well-defined logical foundation to build on, and partly because mathematics provides the logical tool to derive the correct predictions.
That one theory replaces another, sometimes with fundamental effects, is common in biology, not least in medicine. (On my bookshelf there are three books on immunology, published in 1976, 1994 and 2006, respectively. It is hard to see that they are about the same science. On the other hand, there is also a course in basic physics from 1950, which could serve well as present-day teaching material – in terms of content, if not style.) We must always consider a theory to be no more than a set of hypotheses that have not yet been falsified. In fact, mathematics also has an element of this, since a theorem that has been proved has been so only to the extent that no one has yet found a fault in the proof. There are quite a few examples of mathematical theorems that have been held to be true for a period of time until someone found a mistake in their proofs.
1.3 How the Scientific Method Uses Statistics
To produce objective knowledge is difficult, since our intuition has a tendency to see patterns where there is only random noise and to see causal relationships where there are none. When looking for evidence we also have a tendency, as a species, to overvalue information that confirms our hypothesis, and we seek out such confirmatory information. When we encounter new evidence, the quality of it is often assessed against the background of our working assumption, or prior belief, leading to bias in interpretation (and scientific disputes).
To overcome these human shortcomings the so-called scientific method evolved. This is a method which helps us obtain and assess knowledge from data in an objective way. The scientific method seeks to explain nature in a reproducible way, and to use these explanations to make useful predictions. It can be crudely described in the following steps:
1. Formulate a hypothesis.
2. Design and execute an experiment which tests the hypothesis.
3. Based on the outcome of the experiment, determine if we should reject the hypothesis.
To gain acceptance for one's conclusion it is critical that all the details of the research are made available for others to judge their validity, so-called peer review. Not only the results, but also the experimental setup and the data that drive the experimenter to his conclusions. If such details are not provided, others cannot judge to what extent they would agree with the conclusions, and it is not possible to independently repeat the experiment. As the physicist Richard Feynman wrote in a famous essay, condemning what he called ‘cargo cult science’,
if you are doing an experiment, you should report everything that you think might make it invalid – not only what you think is right about it: other causes that could possibly explain your results; and things you thought of that you've eliminated by some other experiment, and how they worked – to make sure that the other fellow can tell if they have been eliminated.
A key part of the scientific method is the design, execution and analysis of an experiment that tests the hypothesis. This may employ mathematical modeling in some way, as when one uses statistical methods. The first step in making a mathematical model related to the hypothesis is to quantify some entities that make it possible to do calculations on numbers. These quantities must reflect the hypothesis under investigation, because it is the analysis of them that will provide us with a conclusion. We call a quantity that is to be analyzed in an experiment an outcome measure, because it is a quantitative measure of the outcome of the experiment. After having decided on the outcome measure, we design our experiment so that we obtain appropriate data. The statistical analysis subsequently performed provides us with what is essentially only a summary presentation of the data, in a form that is appropriate to draw conclusions from.
So, for a hypothesis that is going to be tested by invoking statistics, the scientific method can be expanded into the following steps:
1. Formulate a hypothesis.
2. Define an outcome measure and reformulate the hypothesis in terms of it. This involves defining a statistical model for the data. This version of the hypothesis is called the null hypothesis and is formulated so that it describes what we want to reject.
3. Design and perform an experiment which collects data on this outcome measure.
4. Compute statistical summaries of the data.
5. Draw the appropriate conclusion from the statistical summaries.
When the results are written up as a publication, this should contain an appropriate description of the statistical methods used. Otherwise it may be impossible for peers to judge the validity of the conclusions reached.
The statistical part of the experiment starts with the data and a model for what those data represent. From there onwards it is like a machine that produces a set of summaries of the data that should be helpful in interpreting the outcome of the experiment. For confirmatory purposes, rightly or wrongly, the summary statistic most used is the p-value. It is one particular transformation of the data, with a particular interpretation under the model assumption and the null hypothesis. It measures the probability of the result we observed, or a more extreme one, given that the null hypothesis is true. Thus a p-value is an indirect measure of evidence against the null hypothesis, such that the smaller the value, the greater the evidence. (Often more than one model can be applied to any given set of data so we can derive different p-values for a given hypothesis and set of data – as in the case of parametric versus non-parametric tests.)
Note that, as a consequence of the discussion above, the conclusion from the experiment is either that we consider ourself as having proved the null hypothesis wrong, or we have failed to prove it wrong. Never is the null hypothesis proved to be true. To understand why, look at the hypothesis ‘there are no fish in this lake’ which we may want to test by going fishing. There are two possible outcomes of this test: either you get a fish or you do not. If you catch a fish you know there is (or was) fish in the lake and have disproved the hypothesis. If you do not get any fish, this does not prove anything: it may be because there were no fish in the lake, or it may be because you were unlucky. If you had fished for longer, you may have had a catch and therefore rejected the null hypothesis. There is a saying that captures this and is worth keeping in mind: ‘Absence of proof is not proof of absence.’ Failure to reject a hypothesis does not prove anything, but it may, depending on the nature and quality of the experiment, increase one's confidence in the validity of the null hypothesis – that it to some degree reflects the truth. As such it may be part of a theory of nature, which is held true until data emerge that disprove it.
Failure to understand the difference between not being able to provide enough evidence to reject the null hypothesis and providing evidence for the null hypothesis is at the root of the most important misuse of statistics in medical research.
Example 1.2
In the report of a study on depression with three treatments – no treatment (placebo), a standard treatment, B, and a new treatment, A – the authors made the following claim: ‘A is efficacious in depression and the effect occurs earlier than for B.’ The data underlying the second part of this claim refer to comparisons of A and B individually versus placebo, using data obtained after one week. For A, the corresponding p-value was 0.023, whereas for B it was 0.16. Thus, the argument went, A was ‘statistically significant’, whereas B was not, so A must be better than B.
This is, however, a flawed argument. To make claims about the relative merits of A and B, these must be directly compared. In this case a crude analysis of the data tells us what the result should be. In fact, the first p-value was a result of a mean difference (versus placebo) of 1.27 with a standard error of 0.56, whereas the second p-value comes from a mean difference of 0.79 with the same standard error. The mean difference between A and B is therefore 0.48, and since we should probably have about the same standard error as above, this gives a p-value of about 0.40, which is far from evidence for a difference.
The mistake made in this example is a recurrent one in medical research. It occurs when a statistical test, accompanied by its declaration of ‘significant’ or ‘not significant’, is used to force a decision on the truth or not of the null hypothesis.
1.4 Finding an Outcome Variable to Assess Your Hypothesis
The first step in the expanded version of the scientific method, to reformulate the hypothesis in terms of a specific outcome variable, may be simple, but need not to be. It is simple if your hypothesis is already formulated in terms of it, as when we want to claim that women on the average are shorter than men. The outcome variable then is individual height. It is more difficult if we want to prove that a certain drug improves asthma in patients with that disease. What do we mean by improvement in asthma? Improvement in the lung function? Fewer asthma symptoms? There are many ways we can assess improvement in asthma, and we need to be more specific so that we know what data to collect for the analysis. Assume that we want to focus on lung function. There are also many ways in which we can measure lung function: the simplest would be to ask the patients for a subjective assessment of their lung function, though usually more objective measures are used.
Suppose that we settle for one particular objective lung function measurement, the forced expiratory volume in one second, FEV1. We may want to prove that a new drug improves the patient's asthma by formulating the null hypothesis to read that the drug does not affect FEV1. If we subsequently carry out an experiment and from the analysis of it conclude that there is an improvement in lung function as measured by FEV1, we have disproved the null hypothesis.
The question is what we have proved. The statistical result relates to FEV1. How much can we generalize from this and actually claim that the asthma has been improved? This is a non-trivial issue and one which must be addressed when we decide on which outcome measure to use to reflect our original hypothesis.
Quality of life is measured by having patients fill in a particular questionnaire with a list of questions. The end result we want from the analysis of such a questionnaire is a simple statement: ‘The quality of life of the patients is improved'. In order to achieve that, the scores on individual questions in the questionnaire are typically reduced to a summary number, which is the outcome variable for the statistical analysis. The result may be that there is an increase in this outcome variable when the treatment is given. However, the term ‘quality of life’ has a meaning to most people, and the question is whether an increase in the summary variable corresponds to an increase in the quality of life of the patients, as perceived by the patients. This question necessitates an independent process, in which it is shown that an increase in the derived outcome variable can in fact be interpreted as an improvement of quality of life – a validation of the questionnaire.
The IQ test constitutes a well-known example. IQ is measured as the result of specific IQ tests. If we show that two groups have different outcomes on IQ tests, can we then deduce that one group is more intelligent than the other group? It depends on what we mean by intelligence. If we mean precisely what the IQ test measures, the answer is yes. If we have an independent opinion of what intelligence should mean, we first have to validate that this is captured correctly by the IQ test.
Returning to the measurement of FEV1, for a claim of improvement in asthma, lung function is such an important aspect of asthma that it is reasonable to say that improved lung function means that the asthma has improved (though many would require additional support from data that measure asthma symptoms). However, if we fail to show an effect of FEV1 it does not follow by logical necessity that no other aspect of the asthma has improved. So we deliberately choose one aspect of the disease to gamble on, and if we win we have succeeded. If we fail, we may not be any wiser.
1.5 How we Draw Medical Conclusions from Statistical Results
Before we actually come to the subject of this section we need to consider the ultimate purpose of science, which is to make predictions about the future. What we see in a particular study is an observation. What we want from the study is more than that: we want statements that are helpful when we need to make decisions in the future. We want to use the study to predict what will be seen in a new, similar study. It is an observation that in a particular study 60% of males, but only 40% of females, responded to a treatment. Unless your sample is very large it is not reasonable to generalize this to a claim that 60% of males and 40% of females will respond to the drug in the target population. It may be the best predictor we have at this point in time, but that is not the same thing. What we actually can claim depends on the statistical summary of the data. A more cautious claim may be that in general males respond better to the treatment than females. To substantiate this claim we analyze the data under the null hypothesis that there is no difference in the response rates for males and females.
Suppose next that we want to show that some intervention prolongs life after a cancer diagnosis. Our null hypothesis is that it does not. We assume that we have conducted an appropriate experiment (clinical trial) and that the statistical analysis provides us with . This means that, if there is no effect at all of the intervention, a result as extreme as that found in the experiment is so unlikely that it should occur in only 1.5% of all such clinical trials. This is our confidence in the null hypothesis (not to be confused with the probability of the null hypothesis) after we have performed the experiment.
That does not prove that the intervention is effective. No statistical analysis proves that something is effective. The proper question is: does this p-value provide sufficient support to justify our starting to act as if it is effective? The answer to that question depends on what confidence is required from this particular experiment for a particular action. What are the consequences if I decide that it is effective? A few possibilities are:
I get a license for a new drug, and can earn a lot of money;I get a paper published;I want to take this drug myself, since I have been diagnosed with the cancer in question.In the first case it is really not for me to decide what confidence level is required. It is the licensing authority that needs to be assured. Their problem is on the one hand that they want new, effective drugs on the market, but on the other hand that they do not want useless drugs there. Since all statistics come with an uncertainty, their problem is one of error control. They must make a decision that safeguards the general public from useless drugs, but at the same time they must not make it impossible to get new drugs licensed. This is a balancing act, and they do it by setting a significance level such that if your p-value is smaller than , they agree that the drug is proved to be effective. The significance level defines the proportion of truly useless drugs that will accidentally be approved and therefore the level of risk the licensing agency is prepared to take (if we include almost useless drugs as well, the proportion is higher). Presently one may infer that the US licensing authority, the Food and Drug Administration (FDA), has set the significance level at when it comes to proving efficacy for their market, for reasons we will come back to.
The picture is similar if you want to publish a paper. In general there is an agreed significance level of 5% (two-sided) for that process. If your p-value is less than 5% you can publish a paper and claim that the intervention works. But that does not prove that the intervention works, only that you can get a paper published that claims so. The significance level used by a particular journal is typically not explicitly spelt out, since a remark by the eminent statistician R.A. Fisher led to the introduction of the golden threshold at 5% a long time ago (see Box 1.2), making it unnecessary to argue about it. That is really its only virtue – there is no scientific reason why it should not be 6% or 0.1%. In relation to this particular threshold we now also have some %significance level 5% there has also been introduced a particular paper jargon, the term ‘statistical significance’, which is discussed in some detail in Box 1.3.
Box 1.2 The Origin of the 5% Rule
The 5% significance rule seems to be a consequence of the following passage in the book Statistical Methods for Research Workers by the inventor of the p-value, Ronald Aylmer Fisher:
in practice we do not always want to know the exact value of P for any observed , but, in the first place, whether or not the observed value is open to suspicion. If P is between and there is certainly no reason to suspect the hypothesis tested. If it is below it is strongly indicated that the hypothesis fails to account for the whole of the facts. . . . A value of exceeding the 5 per cent. point is seldom to be disregarded.
It is important that in Fisher's view a p-value below does not force a decision, it only warrants a further investigation. Larger p-values are not worth investigating (note that he does not actually say anything about values between and ). On another occasion he wrote:
This is an arbitrary, but convenient, level of significance for the practical investigator, but it does not mean that he allows himself to be deceived once in every twenty experiments. The test of significance only tells him what to ignore, namely all experiments in which significant results are not obtained.
Nowadays we use the 5% rule in a different way. We use it to force decisions in single studies, referring to an error-rate control mechanism on the ensemble of studies, following a philosophy introduced by Jerzy Neumann and Egon Pearson (see Box 1.3).
Box 1.3 The Meaning of the Term ‘Statistical Significance’
There are two alternative ways of looking at p-values and significance levels which are related to the philosophy of science. Here is a brief outline of these positions.
Thep-value builds confidence. R.A. Fisher originally used p-values purely as a measure of inductive evidence against the null hypothesis. Once the experiment is done there is only one hypothesis, the null, and the p-value measures our confidence in it. There is no need for the significance level; all we need to do is to use the p-value as a measure of our confidence that it is correct to reject the null hypothesis. By presenting the p-value we allow any readers of our results to judge for themselves whether the test has provided enough confidence in the conclusion.
The significance level defines a decision rule. The Neyman–Pearson school instead emphasizes statistical hypothesis testing as a mechanism for making decisions and guiding behavior. To work properly this setup requires two hypotheses to choose between, so the Neyman–Pearson school introduces an alternative hypothesis, in addition to the null hypothesis. A decision between these is then forced, using the test and a predefined significance level α. The alternative is accepted if , otherwise the null hypothesis is accepted. Neyman–Pearson statistical testing is aimed at error minimization, and is not concerned with gathering evidence. Furthermore, this error minimization is of the long-run variety, which means that, unlike Fisher's approach, Neyman–Pearson theory does not apply to an individual study.
In a pure Neyman–Pearson decision approach the exact p-value is irrelevant, and should not be reported at all. When formulated as ‘reject the null hypothesis when , accept it otherwise', only the Neyman–Pearson claim of % false rejections of the null hypothesis with ongoing sampling is valid. This is because is the probability of a set of potential outcomes that may fall anywhere in the tail area of the distribution of the null hypothesis, and we cannot know ahead of time which of these particular outcomes will occur. That is not the same as the tail area that defines the p-value, which is known only after the outcome is observed.
This dualism between Fisher's inductive approach to p-values and the error control of Neyman and Pearson is really about what p-values imply, not what they are. For Fisher it is about inductive learning, for Neyman and Pearson it is about decision making. For Fisher, the Neyman–Pearson view is not relevant to science, since one does not repeat the same experiment over and over again. What researchers actually do is one experiment, from which they should communicate information, not force a yes–no decision.
In the last situation in the bullet list above, the case where you had that particular cancer yourself, you really decide your own significance level. It may be very high, depending on how desperate you are. A significance level of 20% may be good enough for you. It may depend on side-effects and alternative options.
A situation where the interpretation of the p-value as a measure of confidence and its relation to what to do next becomes apparent, is in drug development. Clinical drug development is a staged process in which we sequentially try to answer more and more complex questions such as:
Is the drug effective at all?What is the appropriate dose for this drug?Is the appropriate dose effective enough to get the drug licensed?The monetary investment that needs to be made in order to answer these questions is usually very different. Moreover, the more confidence we want to have in the answer to a particular question, the more money it costs to get that confidence, because larger studies need to be performed. The decision on what confidence we need that a drug is effective at all before conducting a dose-finding study, could then depend on the cost of the latter. Or, rather, a balance between that cost and the loss in time to market, which in itself is a cost. The bottom line is that it may be strategically right for a pharmaceutical company to do a small study which only can produce limited confidence in efficacy, say a one-sided p-value at 10%, before gambling with a larger dose-range study, in order to save time.
In view of the present avalanche of statistical p-values pouring over us – by one estimate some 15 million medical articles have been published to date, with 5000 journals around the world constantly adding to that number – a strict adherence to a rule such as ‘if I can say I have an effect, otherwise not', is a bit primitive, to say the least. Assume (probably incorrectly) that all statistical analyses done are done in a correct manner. Then 5% of all cases investigated where there is no true effect or association, are out there as false effect or relationship claims. We cannot, using statistics, guarantee that there are no false ‘truths’ in circulation, and this level may be appropriate. But most hypotheses tested are part of a bigger context, a theory. If the result we present is a trivial modification of, or an add-on to, what is already known, we may need less assurance than if the result may set an earthquake in motion and have a major impact on society. Ultimately the judgement about the correctness of the null hypothesis will depend on the existence of other data and the relative plausibility of the alternatives.
