35,99 €
Now in its fourth edition, Medical Statistics at a Glance is a concise and accessible introduction to this complex subject. It provides clear instruction on how to apply commonly used statistical procedures in an easy-to-read, comprehensive and relevant volume. This new edition continues to be the ideal introductory manual and reference guide to medical statistics, an invaluable companion for statistics lectures and a very useful revision aid.
This new edition of Medical Statistics at a Glance:
Medical Statistics at a Glance is a must-have text for undergraduate and post-graduate medical students, medical researchers and biomedical and pharmaceutical professionals.
Sie lesen das E-Book in den Legimi-Apps auf:
Seitenzahl: 636
Veröffentlichungsjahr: 2019
This title is also available as an e-book.
For more details, please see
www.wiley.com/buy/9781119167815
Medical Statistics at a Glance Workbook
A comprehensive workbook containing a variety of examples and exercises, complete with model answers, designed to support your learning and revision.
Fully cross-referenced to Medical Statistics at a Glance, this workbook includes:
Over 80 MCQs, each testing knowledge of a single statistical concept or aspect of study interpretation
29 structured questions to explore in greater depth several statistical techniques or principles
Full appraisals of two published papers to demonstrate the use of templates for clinical trials and observational studies
Detailed step-by-step analyses of two substantial data sets (also available at
www.medstatsaag.com
) to demonstrate the application of statistical procedures to real-life research
Medical Statistics at a Glance Workbook is the ideal resource to improve statistical knowledge together with your analytical and interpretational skills.
Aviva Petrie
Honorary Associate Professor of Biostatistics
UCL Eastman Dental Institute
London, UK
Caroline Sabin
Professor of Medical Statistics and Epidemiology
Institute for Global Health
UCL
London, UK
This edition first published 2020
© 2020 Aviva Petrie and Caroline Sabin
Edition History
Aviva Petrie and Caroline Sabin (1e, 2000; 2e, 2005; 3e, 2009).
All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, recording or otherwise, except as permitted by law. Advice on how to obtain permission to reuse material from this title is available at http://www.wiley.com/go/permissions.
The right of Aviva Petrie and Caroline Sabin to be identified as the authors of this work has been asserted in accordance with law.
Registered Office(s)
John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, USA
John Wiley & Sons Ltd, The Atrium, Southern Gate, Chichester, West Sussex, PO19 8SQ, UK
Editorial Office
9600 Garsington Road, Oxford, OX4 2DQ, UK
For details of our global editorial offices, customer services, and more information about Wiley products visit us at www.wiley.com.
Wiley also publishes its books in a variety of electronic formats and by print-on-demand. Some content that appears in standard print versions of this book may not be available in other formats.
Limit of Liability/Disclaimer of Warranty
The contents of this work are intended to further general scientific research, understanding, and discussion only and are not intended and should not be relied upon as recommending or promoting scientific method, diagnosis, or treatment by physicians for any particular patient. In view of ongoing research, equipment modifications, changes in governmental regulations, and the constant flow of information relating to the use of medicines, equipment, and devices, the reader is urged to review and evaluate the information provided in the package insert or instructions for each medicine, equipment, or device for, among other things, any changes in the instructions or indication of usage and for added warnings and precautions. While the publisher and authors have used their best efforts in preparing this work, they make no representations or warranties with respect to the accuracy or completeness of the contents of this work and specifically disclaim all warranties, including without limitation any implied warranties of merchantability or fitness for a particular purpose. No warranty may be created or extended by sales representatives, written sales materials or promotional statements for this work. The fact that an organization, website, or product is referred to in this work as a citation and/or potential source of further information does not mean that the publisher and authors endorse the information or services the organization, website, or product may provide or recommendations it may make. This work is sold with the understanding that the publisher is not engaged in rendering professional services. The advice and strategies contained herein may not be suitable for your situation. You should consult with a specialist where appropriate. Further, readers should be aware that websites listed in this work may have changed or disappeared between when this work was written and when it is read. Neither the publisher nor authors shall be liable for any loss of profit or any other commercial damages, including but not limited to special, incidental, consequential, or other damages.
Library of Congress Cataloging-in-Publication Data
Names: Petrie, Aviva, author. | Sabin, Caroline, author.
Title: Medical statistics at a glance / Aviva Petrie, Caroline Sabin.
Description: Fourth edition. | Hoboken, NJ : Wiley-Blackwell, 2020. |
Includes bibliographical references and index. |
Identifiers: LCCN 2019008181 (print) | LCCN 2019008704 (ebook) | ISBN
9781119167822 (Adobe PDF) | ISBN 9781119167839 (ePub) | ISBN 9781119167815
(pbk.)
Subjects: | MESH: Statistics as Topic | Research Design
Classification: LCC R853.S7 (ebook) | LCC R853.S7 (print) | NLM WA 950 | DDC
610.72/7—dc23
LC record available at https://lccn.loc.gov/2019008181
Cover Design: Wiley
Cover Image: © Somyot Techapuwapat/EyeEm/Getty Images
Cover
Also available to buy!
Preface
Part 1 Handling data
1 Types of data
Data and statistics
Categorical (qualitative) data
Numerical (quantitative) data
Distinguishing between data types
Derived data
Censored data
2 Data entry
Formats for data entry
Planning data entry
Categorical data
Numerical data
Multiple forms per patient
Problems with dates and times
Coding missing values
3 Error checking and outliers
Typing errors
Error checking
Handling missing data
Outliers
References
4 Displaying data diagrammatically
One variable
Two variables
Identifying outliers using graphical methods
The use of connecting lines in diagrams
5 Describing data: the ‘average’
Summarizing data
The arithmetic mean
The median
The mode
The geometric mean
The weighted mean
6 Describing data: the ‘spread’
Summarizing data
The range
Ranges derived from percentiles
The standard deviation
Variation within- and between-subjects
7 Theoretical distributions: the Normal distribution
Understanding probability
The rules of probability
Probability distributions: the theory
The Normal (Gaussian) distribution
The Standard Normal distribution
8 Theoretical distributions: other distributions
Some words of comfort
More continuous probability distributions
Discrete probability distributions
9 Transformations
Why transform?
How do we transform?
Typical transformations
Part 2 Sampling and estimation
10 Sampling and sampling distributions
Why do we sample?
Obtaining a representative sample
Point estimates
Sampling variation
Sampling distribution of the mean
Interpreting standard errors
SD or SEM?
Sampling distribution of the proportion
11 Confidence intervals
Confidence interval for the mean
Confidence interval for the proportion
Interpretation of confidence intervals
Degrees of freedom
Bootstrapping and jackknifing
Reference
Part 3 Study design
12 Study design I
Experimental or observational studies
Defining the unit of observation
Multicentre studies
Assessing causality
Cross-sectional or longitudinal studies
Controls
Bias
Reference
13 Study design II
Variation
Replication
Sample size
Particular study designs
Choosing an appropriate study endpoint
References
14 Clinical trials
Treatment comparisons
Primary and secondary endpoints
Subgroup analyses
Treatment allocation
Sequential trials
Blinding or masking
Patient issues
The protocol
References
15 Cohort studies
Selection of cohorts
Follow-up of individuals
Information on outcomes and exposures
Analysis of cohort studies
Advantages of cohort studies
Disadvantages of cohort studies
Study management
Clinical cohorts
16 Case–control studies
Selection of cases
Selection of controls
Identification of risk factors
Matching
Analysis of unmatched or group-matched case–control studies
Analysis of individually matched case–control studies
Advantages of case–control studies
Disadvantages of case–control studies
References
Part 4 Hypothesis testing
17 Hypothesis testing
Defining the null and alternative hypotheses
Obtaining the test statistic
Obtaining the
P
-value
Using the
P
-value
Non-parametric tests
Which test?
Hypothesis tests versus confidence intervals
Equivalence and non-inferiority trials
References
18 Errors in hypothesis testing
Making a decision
Making the wrong decision
Power and related factors
Multiple hypothesis testing
References
Part 5 Basic techniques for analysing data
19 Numerical data: a single group
The problem
The one-sample
t
-test
The sign test
20 Numerical data: two related groups
The problem
The paired
t
-test
The Wilcoxon signed ranks test
Reference
21 Numerical data: two unrelated groups
The problem
The unpaired (two-sample)
t
-test
The Wilcoxon rank sum (two-sample) test
Reference
22 Numerical data: more than two groups
The problem
One-way analysis of variance
The Kruskal–Wallis test
References
23 Categorical data: a single proportion
The problem
The test of a single proportion
The sign test applied to a proportion
24 Categorical data: two proportions
The problems
Independent groups: the Chi-squared test
Related groups: McNemar’s test
Reference
25 Categorical data: more than two categories
Chi-squared test: large contingency tables
Chi-squared test for trend
26 Correlation
Pearson correlation coefficient
Spearman’s rank correlation coefficient
27 The theory of linear regression
What is linear regression?
The regression line
Method of least squares
Assumptions
Analysis of variance table
Regression to the mean
28 Performing a linear regression analysis
The linear regression line
Drawing the line
Checking the assumptions
Failure to satisfy the assumptions
Outliers and influential points
Assessing goodness of fit
Investigating the slope
Using the line for prediction
Improving the interpretation of the model
29 Multiple linear regression
What is it?
Why do it?
Assumptions
Categorical explanatory variables
Analysis of covariance
Choice of explanatory variables
Analysis
Outliers and influential points
Reference
30 Binary outcomes and logistic regression
Reasoning
The logistic regression equation
The explanatory variables
Assessing the adequacy of the model
Comparing the odds ratio and the relative risk
Multinomial and ordinal logistic regression
Conditional logistic regression
References
31 Rates and Poisson regression
Rates
Poisson regression
32 Generalized linear models
Which type of model do we choose?
Likelihood and maximum likelihood estimation
Assessing adequacy of fit
Regression diagnostics
33 Explanatory variables in statistical models
Nominal explanatory variables
Ordinal explanatory variables
Numerical explanatory variables
Selecting explanatory variables
Interaction
Collinearity
Confounding
34 Bias and confounding
Bias
Confounding
References
35 Checking assumptions
Why bother?
Are the data Normally distributed?
Are two or more variances equal?
Are variables linearly related?
What if the assumptions are not satisfied?
Sensitivity analysis
References
36 Sample size calculations
The importance of sample size
Requirements
Methodology
Altman’s nomogram
Quick formulae
Power statement
Adjustments
Increasing the power for a fixed sample size
References
37 Presenting results
Numerical results
Tables
Diagrams
Presenting results in a paper
References
Part 6 Additional chapters
38 Diagnostic tools
Reference intervals
Diagnostic tests
39 Assessing agreement
Measurement variability and error
Reliability
Categorical variables
Numerical variables
Reporting guidelines
References
40 Evidence-based medicine
1 Formulate the clinical question (PICO)
2 Locate the relevant information (e.g. on diagnosis, prognosis or therapy)
3 Critically appraise the methods in order to assess the validity (closeness to the truth) of the evidence
4 Extract the most useful results and determine whether they are important
5 Apply the results in clinical practice
6 Evaluate your performance
References
41 Methods for clustered data
Displaying the data
Comparing groups: inappropriate analyses
Comparing groups: appropriate analyses
Reference
42 Regression methods for clustered data
Aggregate level analysis
Robust standard errors
Random effects models
Generalized estimating equations (GEE)
References
43 Systematic reviews and meta-analysis
The systematic review
Meta-analysis
References
44 Survival analysis
Censored data
Displaying survival data
Summarizing survival
Comparing survival
Problems encountered in survival analysis
Reference
45 Bayesian methods
The frequentist approach
The Bayesian approach
Diagnostic tests in a Bayesian framework
Disadvantages of Bayesian methods
Reference
46 Developing prognostic scores
Why do we do it?
Assessing the performance of a prognostic score
Developing prognostic indices and risk scores for other types of data
Reporting guidelines
Appendices
Appendix A: Statistical tables
Reference
Appendix B: Altman’s nomogram for sample size calculations (Chapter 36)
Appendix C: Typical computer output
Appendix D: Checklists and trial profile from the EQUATOR network and critical appraisal templates
Equator Network Statements
Critical Appraisal Templates
Reference
Appendix E: Glossary of terms
Appendix F: Chapter numbers with relevant multiple-choice questions and structured questions from
Medical Statistics at a Glance Workbook
Index
End User License Agreement
Chapter 5
Table 5.1
Chapter 6
Table 6.1
Chapter 12
Table 12.1
Chapter 13
Table 13.1
Chapter 15
Table 15.1
Table 15.2
Chapter 16
Table 16.1
Table 16.2
Chapter 18
Table 18.1
Chapter 20
Table 20.1
Chapter 21
Table 21.1
Chapter 24
Table 24.1
Table 24.2
Chapter 25
Table 25.1
Table 25.2
Chapter 28
Table 28.1
Table 28.2
Chapter 31
Table 31.1
Chapter 32
Table 32.1
Table 32.2
Chapter 33
Table 33.1
Chapter 34
Table 34.1
Chapter 36
Table 36.1
Chapter 38
Table 38.1
Chapter 39
Table 39.1
Table 39.2
Chapter 42
Table 42.1
Table 42.2
Chapter 43
Table 43.1
Chapter 44
Table 44.1
Chapter 46
Table 46.1
Appendix A
Table A1
Table A2
Table A3
Table A4
Table A5
Table A6
Table A7
Table A8
Table A9(a)
Table A9(b)
Table A10
Table A11
Table A12
Appendix D
Table D1
Table D2
Cover
Table of Contents
Preface
ii
iii
v
vi
ix
x
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
Medical Statistics at a Glance is directed at undergraduate medical students, medical researchers, postgraduates in the biomedical disciplines and at pharmaceutical industry personnel. All of these individuals will, at some time in their professional lives, be faced with quantitative results (their own or those of others) which will need to be critically evaluated and interpreted, and some, of course, will have to pass that dreaded statistics exam! A proper understanding of statistical concepts and methodology is invaluable for these needs. Much as we should like to fire the reader with an enthusiasm for the subject of statistics, we are pragmatic. Our aim in this new edition, as it was in the earlier editions, is to provide the student and the researcher, as well as the clinician encountering statistical concepts in the medical literature, with a book which is sound, easy to read, comprehensive, relevant, and of useful practical application.
We believe Medical Statistics at a Glance will be particularly helpful as an adjunct to statistics lectures and as a reference guide. The structure of this fourth edition is the same as that of the first three editions. In line with other books in the At a Glance series, we lead the reader through a number of self-contained two-, three- or occasionally four-page chapters, each covering a different aspect of medical statistics. There is extensive cross-referencing throughout the text to help the reader link the various procedures. We have learned from our own teaching experiences and have taken account of the difficulties that our students have encountered when studying medical statistics. For this reason, we have chosen to limit the theoretical content of the book to a level that is sufficient for understanding the procedures involved, yet which does not overshadow the practicalities of their execution.
Medical statistics is a wide-ranging subject covering a large number of topics. We have provided a basic introduction to the underlying concepts of medical statistics and a guide to the most commonly used statistical procedures. Epidemiology, concerned with the distribution and determinants of disease in specified populations, is closely allied to medical statistics. Hence some of the main issues in epidemiology, relating to study design and interpretation, are discussed. Also included are chapters that the reader may find useful only occasionally, but which are, nevertheless, fundamental to many areas of medical research; for example, evidence-based medicine, systematic reviews and meta-analysis, survival analysis, Bayesian methods and the development of prognostic scores. We have explained the principles underlying these topics so that the reader will be able to understand and interpret the results from them when they are presented in the literature.
A basic set of statistical tables is contained in Appendix A. Neave, H.R. (1995) Elementary Statistical Tables, Routledge: London, and Diem, K. Lenter, C. and Seldrup (1981) Geigy Scientific Tables, 8th rev. and enl. edition, Basle: Ciba-Geigy, amongst others, provide fuller versions if the reader requires more precise results for hand calculations. We have included a new appendix, Appendix D, in this fourth edition. This appendix contains guidelines for randomized controlled trials (the CONSORT checklist and flow chart) and observational studies (the STROBE checklist). The CONSORT and STROBE checklists are produced by the EQUATOR Network, initiated with the objectives of providing resources and training for the reporting of health research. Guidelines for the presentation of study results are now available for many other types of study and we provide website addresses in a table in Appendix D for some of these designs. Appendix D also contains templates that we hope you will find useful when you critically appraise or evaluate the evidence in randomized controlled trials and observational studies. The use of these templates to critically appraise two published papers is demonstrated in our Medical Statistics at a Glance Workbook. Due to the inclusion of the new Appendix D, the labeling of the final two appendices differs from that of the third edition: Appendix E now contains the Glossary of terms with readily accessible explanations of commonly used terminology, and Appendix F provides cross-referencing of multiple choice and structured questions from Medical Statistics at a Glance Workbook.
The chapter titles of this fourth edition are identical to those of the third edition. Some of the first 46 chapters remain unaltered in this new edition and some have relatively minor changes which accommodate recent advances, cross-referencing or re-organization of the new material. In particular, where appropriate, we have provided references to the relevant EQUATOR guidelines.
As in the third edition, we provide a set of learning objectives for each chapter. Each set provides a framework for evaluating understanding and progress. If you are able to complete all the bulleted tasks in a chapter satisfactorily, you will have mastered the concepts in that chapter.
Most of the statistical techniques described in the book are accompanied by examples illustrating their use. We have replaced many of the older examples that were in previous editions by those that are commensurate with current clinical research. We have generally obtained the data for our examples from collaborative studies in which we or colleagues have been involved; in some instances, we have used real data from published papers. Where possible, we have used the same data set in more than one chapter to reflect the reality of data analysis, which is rarely restricted to a single technique or approach. Although we believe that formulae should be provided and the logic of the approach explained as an aid to understanding, we have avoided showing the details of complex calculations – most readers will have access to computers and are unlikely to perform any but the simplest calculations by hand.
We consider that it is particularly important for the reader to be able to interpret output from a computer package. We have therefore chosen, where applicable, to show results using extracts from computer output. In some instances, where we believe individuals may have difficulty with its interpretation, we have included (Appendix C) and annotated the complete computer output from an analysis of a data set. There are many statistical packages in common use; to give the reader an indication of how output can vary, we have not restricted the output to a particular package and have, instead, used four well-known ones – SAS, SPSS, Stata and R.
We know that one of the greatest difficulties facing non-statisticians is choosing the appropriate technique. We have therefore produced two flow charts which can be used both to aid the decision as to what method to use in a given situation and to locate a particular technique in the book easily. These flow charts are displayed prominently on the inside back cover for easy access.
The reader may find it helpful to assess his/her progress in self-directed learning by attempting the interactive exercises on our website (www.medstatsaag.com) or the multiple choice and structured questions, all with model answers, in our Medical Statistics at a Glance Workbook. The website also contains a full set of references (some of which are linked directly to Medline) to supplement the references quoted in the text and provide useful background information for the examples. For those readers who wish to gain a greater insight into particular areas of medical statistics, we can recommend the following books:
Altman, D.G. (1991)
Practical Statistics for Medical Research
. London: Chapman and Hall/CRC.
Armitage, P., Berry, G. and Matthews, J.F.N. (2001)
Statistical Methods in Medical Research.
4th edition. Oxford: Blackwell Science.
Kirkwood, B.R. and Sterne, J.A.C. (2003)
Essential Medical Statistics.
2nd edition. Oxford: Blackwell Publishing.
Pocock, S.J. (1983)
Clinical Trials: A Practical Approach
. Chichester: Wiley.
We are extremely grateful to Mark Gilthorpe and Jonathan Sterne who made invaluable comments and suggestions on aspects of the second edition, and to Richard Morris, Fiona Lampe, Shak Hajat and Abul Basar for their counsel on the first edition. We wish to thank everyone who has helped us by providing data for the examples. Naturally, we take full responsibility for any errors that remain in the text or examples. We should also like to thank Mike, Gerald, Nina, Andrew and Karen who tolerated, with equanimity, our preoccupation with the first three editions and for their unconditional support, patience and encouragement as we laboured to produce this fourth edition.
Aviva Petrie
Caroline Sabin
London
1
Types of data
2
Data entry
3
Error checking and outliers
4
Displaying data diagrammatically
5
Describing data: the ‘average’
6
Describing data: the ‘spread’
7
Theoretical distributions: the Normal distribution
8
Theoretical distributions: other distributions
9
Transformations
Learning objectives
By the end of this chapter, you should be able to:
Distinguish between a sample and a population
Distinguish between categorical and numerical data
Describe different types of categorical and numerical data
Explain the meaning of the terms: variable, percentage, ratio, quotient, rate, score
Explain what is meant by censored data
Relevant Workbook questions: MCQs 1, 2 and 16; and SQ 1 available online
The purpose of most studies is to collect data to obtain information about a particular area of research. Our data comprise observations on one or more variables; any quantity that varies is termed a variable. For example, we may collect basic clinical and demographic information on patients with a particular illness. The variables of interest may include the sex, age and height of the patients.
Our data are usually obtained from a sample of individuals that represents the population of interest. Our aim is to condense these data in a meaningful way and extract useful information from them. Statistics encompasses the methods of collecting, summarizing, analysing and drawing conclusions from the data: we use statistical techniques to achieve our aim.
Data may take many different forms. We need to know what form every variable takes before we can make a decision regarding the most appropriate statistical methods to use. Each variable and the resulting data will be one of two types: categorical or numerical (Fig. 1.1).
These occur when each individual can only belong to one of a number of distinct categories of the variable.
Nominal data
– the categories are not ordered but simply have names. Examples include blood group (A, B, AB and O) and marital status (married/widowed/single, etc.). In this case, there is no reason to suspect that being married is any better (or worse) than being single!
Ordinal data
– the categories are ordered in some way. Examples include disease staging systems (advanced, moderate, mild, none) and degree of pain (severe, moderate, mild, none).
A categorical variable is binary or dichotomous when there are only two possible categories. Examples include ‘Yes/No’, ‘Dead/Alive’ or ‘Patient has disease/Patient does not have disease’.
These occur when the variable takes some numerical value. We can subdivide numerical data into two types.
Discrete data
– occur when the variable can only take certain whole numerical values. These are often counts of numbers of events, such as the number of visits to a GP in a particular year or the number of episodes of illness in an individual over the last 5 years.
Continuous data
– occur when there is no limitation on the values that the variable can take, e.g. weight or height, other than that which restricts us when we make the measurement.
We often use very different statistical methods depending on whether the data are categorical or numerical. Although the distinction between categorical and numerical data is usually clear, in some situations it may become blurred. For example, when we have a variable with a large number of ordered categories (e.g. a pain scale with seven categories), it may be difficult to distinguish it from a discrete numerical variable. The distinction between discrete and continuous numerical data may be even less clear, although in general this will have little impact on the results of most analyses. Age is an example of a variable that is often treated as discrete even though it is truly continuous. We usually refer to ‘age at last birthday’ rather than ‘age’, and therefore, a woman who reports being 30 may have just had her 30th birthday, or may be just about to have her 31st birthday.
Do not be tempted to record numerical data as categorical at the outset (e.g. by recording only the range within which each patient’s age falls rather than his/her actual age) as important information is often lost. It is simple to convert numerical data to categorical data once they have been collected.
We may encounter a number of other types of data in the medical field. These include:
Percentages
– these may arise when considering improvements in patients following treatment, e.g. a patient’s lung function (forced expiratory volume in 1 second, FEV1) may increase by 24% following treatment with a new drug. In this case, it is the level of improvement, rather than the absolute value, which is of interest.
Ratios
or
quotients
– occasionally you may encounter the ratio or quotient of two variables. For example, body mass index (BMI), calculated as an individual’s weight (kg) divided by her/his height squared (m
2
), is often used to assess whether s/he is over- or underweight.
Rates
– disease rates, in which the number of disease events occurring among individuals in a study is divided by the total number of years of follow-up of all individuals in that study (
Chapter 31
), are common in epidemiological studies (
Chapter 12
).
Scores
– we sometimes use an arbitrary value, such as a score, when we cannot measure a quantity. For example, a series of responses to questions on quality of life may be summed to give some overall quality of life score on each individual.
All these variables can be treated as numerical variables for most analyses. Where the variable is derived using more than one value (e.g. the numerator and denominator of a percentage), it is important to record all of the values used. For example, a 10% improvement in a marker following treatment may have different clinical relevance depending on the level of the marker before treatment.
We may come across censored data in situations illustrated by the following examples.
If we measure laboratory values using a tool that can only detect levels above a certain cut-off value, then any values below this cut-off will not be detected, i.e. they are censored. For example, when measuring virus levels, those below the limit of detectability will often be reported as ‘undetectable’ or ‘unquantifiable’ even though there may be some virus in the sample. In this situation, if the lower cut-off of a tool is
x
, say, the results may be reported as ‘ <
x
’. Similarly, some tools may only be able to reliably quantify levels below a certain cut-off value, say
y
; any measurements above that value will also be censored and the test result may be reported as ‘ >
y
’.
We may encounter censored data when following patients in a trial in which, for example, some patients withdraw from the trial before the trial has ended. This type of data is discussed in more detail in
Chapter 44
.
Learning objectives
By the end of this chapter, you should be able to:
Describe different formats for entering data on to a computer
Outline the principles of questionnaire design
Distinguish between single-coded and multi-coded variables
Describe how to code missing values
Relevant Workbook questions: MCQs 1, 3 and 4; and SQ 1 available online
When you carry out any study you will almost always need to enter the data into a computer package. Computers are invaluable for improving the accuracy and speed of data collection and analysis, making it easy to check for errors, produce graphical summaries of the data and generate new variables. It is worth spending some time planning data entry – this may save considerable effort at later stages.
There are a number of ways in which data can be entered and stored on a computer. Most statistical packages allow you to enter data directly. However, the limitation of this approach is that often you cannot move the data to another package. A simple alternative is to store the data in either a spreadsheet or database package. Unfortunately, their statistical procedures are often limited, and it will usually be necessary to output the data into a specialist statistical package to carry out analyses.
A more flexible approach is to have your data available as an ASCII or text file. Once in an ASCII format, the data can be read by most packages. ASCII format simply consists of rows of text that you can view on a computer screen. Usually, each variable in the file is separated from the next by some delimiter, often a space or a comma. This is known as free format.
The simplest way of entering data in ASCII format is to type the data directly in this format using either a word processing or editing package. Alternatively, data stored in spreadsheet packages can be saved in ASCII format. Using either approach, it is customary for each row of data to correspond to a different individual in the study, and each column to correspond to a different variable, although it may be necessary to go on to subsequent rows if data from a large number of variables are collected on each individual.
When collecting data in a study you will often need to use a form or questionnaire for recording the data. If these forms are designed carefully, they can reduce the amount of work that has to be done when entering the data. Generally, these forms/questionnaires include a series of boxes in which the data are recorded – it is usual to have a separate box for each possible digit of the response.
Some statistical packages have problems dealing with non-numerical data. Therefore, you may need to assign numerical codes to categorical data before entering the data into the computer. For example, you may choose to assign the codes of 1, 2, 3 and 4 to categories of ‘no pain’, ‘mild pain’, ‘moderate pain’ and ‘severe pain’, respectively. These codes can be added to the forms when collecting the data. For binary data, e.g. yes/no answers, it is often convenient to assign the codes 1 (e.g. for ‘yes’) and 0 (for ‘no’).
Single-coded
variables – there is only one possible answer to a question, e.g. ‘Is the patient dead?’ It is not possible to answer both ‘yes’ and ‘no’ to this question.
Multi-coded
variables – more than one answer is possible for each respondent. For example, ‘What symptoms has this patient experienced?’ In this case, an individual may have experienced any of a number of symptoms. There are two ways to deal with this type of data depending upon which of the two following situations applies.
There are only a few possible symptoms, and individuals may have experienced many of them.
A number of different binary variables can be created that correspond to whether the patient has answered yes or no to the presence of each possible symptom. For example, ‘Did the patient have a cough?’, ‘Did the patient have a sore throat?’
There are a very large number of possible symptoms but each patient is expected to suffer from only a few of them.
A number of different nominal variables can be created; each successive variable allows you to name a symptom suffered by the patient. For example, ‘What was the first symptom the patient suffered?’, ‘What was the second symptom?’ You will need to decide in advance the maximum number of symptoms you think a patient is likely to have suffered.
Numerical data should be entered with the same precision as they are measured, and the unit of measurement should be consistent for all observations on a variable. For example, weight should be recorded in kilograms or in pounds, but not both interchangeably.
Sometimes, information is collected on the same patient on more than one occasion. It is important that there is some unique identifier (e.g. a serial number) relating to the individual that will enable you to link all of the data from an individual in the study.
Dates and times should be entered in a consistent manner, e.g. either as day/month/year or month/day/year, but not interchangeably. It is important to find out what format the statistical package can read.
You should consider what you will do with missing values before you enter the data. In most cases you will need to use some symbol to represent a missing value. Statistical packages deal with missing values in different ways. Some use special characters (e.g. a full stop or asterisk) to indicate missing values, whereas others require you to define your own code for a missing value (commonly used values are 9, 999 or − 99). The value that is chosen should be one that is not possible for that variable. For example, when entering a categorical variable with four categories (coded 1, 2, 3 and 4), you may choose the value 9 to represent missing values. However, if the variable is ‘age of child’ then a different code should be chosen. Missing data are discussed in more detail in Chapter 3.
As part of a study on the effect of inherited bleeding disorders on pregnancy and childbirth, data were collected on a sample of 64 women registered at a single haemophilia centre in London. The women were asked questions relating to their bleeding disorder and their first pregnancy (or their current pregnancy if they were pregnant for the first time on the date of interview). Figure 2.1 shows the data from a small selection of the women after the data have been entered onto a spreadsheet, but before they have been checked for errors. The coding schemes for the categorical variables are shown at the bottom of Fig. 2.1. Each row of the spreadsheet represents a separate individual in the study; each column represents a different variable. Where the woman is still pregnant, the age of the woman at the time of birth has been calculated from the estimated date of the baby’s delivery. Data relating to the live births are summarized in Table 37.1 in Chapter 37.
Data kindly provided by Dr R.A. Kadir, University Department of Obstetrics and Gynaecology, and Professor C.A. Lee, Haemophilia Centre and Haemostasis Unit, Royal Free Hospital, London.
Learning objectives
By the end of this chapter, you should be able to:
Describe how to check for errors in data
Distinguish between data that are missing completely at random, missing at random and missing not at random
Outline the methods of dealing with missing data, distinguishing between single and multiple imputation
Define an outlier
Explain how to check for and handle outliers
Relevant Workbook questions: MCQs 5 and 6; and SQs 1 and 28 available online
In any study there is always the potential for errors to occur in a data set, either at the outset when taking measurements, or when collecting, transcribing and entering the data into a computer. It is hard to eliminate all of these errors. However, you can reduce the number of typing and transcribing errors by checking the data carefully once they have been entered. Simply scanning the data by eye will often identify values that are obviously wrong. In this chapter we suggest a number of other approaches that you can use when checking data.
Typing mistakes are the most frequent source of errors when entering data. If the amount of data is small, then you can check the typed data set against the original forms/questionnaires to see whether there are any typing mistakes. However, this is time-consuming if the amount of data is large. It is possible to type the data in twice and compare the two data sets using a computer program. Any differences between the two data sets will reveal typing mistakes. Although this approach does not rule out the possibility that the same error has been incorrectly entered on both occasions, or that the value on the form/questionnaire is incorrect, it does at least minimize the number of errors. The disadvantage of this method is that it takes twice as long to enter the data, which may have major cost or time implications.
Categorical data
– it is relatively easy to check categorical data, as the responses for each variable can only take one of a number of limited values. Therefore, values that are not allowable must be errors.
Numerical data
– numerical data are often difficult to check but are prone to errors. For example, it is simple to transpose digits or to misplace a decimal point when entering numerical data. Numerical data can be
range checked
– that is, upper and lower limits can be specified for each variable. If a value lies outside this range then it is flagged up for further investigation.
Dates
– it is often difficult to check the accuracy of dates, although sometimes you may know that dates must fall within certain time periods. Dates can be checked to make sure that they are valid. For example, 30th February must be incorrect, as must any day of the month greater than 31, and any month greater than 12. Certain logical checks can also be applied. For example, a patient’s date of birth should correspond to his/her age, and patients should usually have been born before entering the study (at least in most studies). In addition, patients who have died should not appear for subsequent follow-up visits!
With all error checks, a value should only be corrected if there is evidence that a mistake has been made. You should not change values simply because they look unusual.
There is always a chance that some data will be missing. If a large proportion of the data is missing, then the results are unlikely to be reliable. The reasons why data are missing should always be investigated – if missing data tend to cluster on a particular variable and/or in a particular subgroup of individuals, then it may indicate that the variable is not applicable or has never been measured for that group of individuals. If this is the case, it may be necessary to exclude that variable or group of individuals from the analysis. There are different types of missing data1:
Missing completely at random (MCAR)
– the missing values are truly randomly distributed in the data set and the fact that they are missing is unrelated to any study variable. The resulting parameter estimates are unlikely to be biased (
Chapter 34
). An example is when a patient fails to attend a hospital appointment because he is in a car accident.
Missing at random (MAR)
– the missing values of a variable do not depend on that variable but can be completely explained by non-missing values of one or more of the other variables. For example, suppose that individuals are asked to keep a diet diary if their BMI is above 30 kg/m
2
: the missing diet diary data are MAR because missingness is completely determined by BMI (those with a BMI below the cut-off do not complete the diet diary).
Missing not at random (MNAR)
– the chance that data on a particular variable are missing is strongly related to that variable. In this situation, our results may be severely biased For example, suppose we are interested in a measurement that reflects the health status of patients and this information is missing for some patients because they were not well enough to attend their clinic appointments: we are likely to get an overly optimistic overall view of the patients’ health if we take no account of the missing data in the analysis.
Provided the missing data are not MNAR, we may be able to estimate (impute1) the missing data2. A simple approach is to replace a missing observation by the mean of the existing observations for that variable or, if the data are longitudinal, by the last observed value. These are examples of single imputation. In multiple imputation, we create a number (generally up to five) of imputed data sets from the original data set, with the missing values replaced by imputed values which are derived from an appropriate model that incorporates random variation. We then use standard statistical procedures on each complete imputed data set and finally combine the results from these analyses. Alternative statistical approaches to dealing with missing data are available2, but the best option is to minimize the amount of missing data at the outset.
Outliers are observations that are distinct from the main body of the data, and are incompatible with the rest of the data. These values may be genuine observations from individuals with very extreme levels of the variable. However, they may also result from typing errors or the incorrect choice of units, and so any suspicious values should be checked. It is important to detect whether there are outliers in the data set, as they may have a considerable impact on the results from some types of analyses (Chapter 29).
For example, a woman who is 7 feet tall would probably appear as an outlier in most data sets. However, although this value is clearly very high, compared with the usual heights of women, it may be genuine and the woman may simply be very tall. In this case, you should investigate this value further, possibly checking other variables such as her age and weight, before making any decisions about the validity of the result. The value should only be changed if there really is evidence that it is incorrect.
A simple approach is to print the data and visually check them by eye. This is suitable if the number of observations is not too large and if the potential outlier is much lower or higher than the rest of the data. Range checking should also identify possible outliers. Alternatively, the data can be plotted in some way (Chapter 4) – outliers can be clearly identified on histograms and scatter plots (see also Chapter 29 for a discussion of outliers in regression analysis).
It is important not to remove an individual from an analysis simply because his/her values are higher or lower than might be expected. However, the inclusion of outliers may affect the results when some statistical techniques are used. A simple approach is to repeat the analysis both including and excluding the value – this is a type of sensitivity analysis (Chapter 35). If the results are similar, then the outlier does not have a great influence on the result. However, if the results change drastically, it is important to use appropriate methods that are not affected by outliers to analyse the data. These include the use of transformations (Chapter 9) and non-parametric tests (Chapter 17).
After entering the data described in Chapter 2, the data set is checked for errors (Fig. 3.1). Some of the inconsistencies highlighted are simple data entry errors. For example, the code of ‘41’ in the ‘Sex of baby’ column is incorrect as a result of the sex information being missing for patient 20; the rest of the data for patient 20 had been entered in the incorrect columns. Others (e.g. unusual values in the gestational age and weight columns) are likely to be errors, but the notes should be checked before any decision is made, as these may reflect genuine outliers. In this case, the gestational age of patient number 27 was 41 weeks, and it was decided that a weight of 11.19 kg was incorrect. As it was not possible to find the correct weight for this baby, the value was entered as missing.
Bland, M. (2015)
An Introduction to Medical Statistics
. 4th edition. Oxford University Press.
Horton, N.J. and Kleinman, K.P. (2007) Much ado about nothing: a comparison of missing data methods and software to fit incomplete data regression models.
American Statistician
, 61(1), 71–90.
Learning objectives
By the end of this chapter, you should be able to:
Explain what is meant by a frequency distribution
Describe the shape of a frequency distribution
Describe the following diagrams: (segmented) bar or column chart, pie chart, histogram, dot plot, stem-and-leaf plot, box-and-whisker plot, scatter diagram
Explain how to identify outliers from a diagram in various circumstances
Describe the situations when it is appropriate to use connecting lines in a diagram
Relevant Workbook questions: MCQs 7, 8, 9, 37 and 50; and SQs 1 and 9 available online
One of the first things that you may wish to do when you have entered your data into a computer is to summarize them in some way so that you can get a ‘feel’ for the data. This can be done by producing diagrams, tables or summary statistics (Chapters 5 and 6). Diagrams are often powerful tools for conveying information about the data, for providing simple summary pictures, and for spotting outliers and trends before any formal analyses are performed.
An empirical frequency distribution of a variable relates each possible observation, class of observations (i.e. range of values) or category, as appropriate, to its observed frequency of occurrence. If we replace each frequency by a relative frequency (the percentage of the total frequency), we can compare frequency distributions in two or more groups of individuals.
Once the frequencies (or relative frequencies) have been obtained for categorical or some discrete numerical data, these can be displayed visually.
Bar or column chart
– a separate horizontal or vertical bar is drawn for each category, its length being proportional to the frequency in that category. The bars are separated by small gaps to indicate that the data are categorical or discrete (
Fig. 4.1
a).
Pie chart
– a circular ‘pie’ is split into sectors, one for each category, so that the area of each sector is proportional to the frequency in that category (
Fig. 4.1
b).
It is often more difficult to display continuous numerical data, as the data may need to be summarized before being drawn. Commonly used diagrams include the following:
Histogram
– this is similar to a bar chart, but there should be no gaps between the bars as the data are continuous (
Fig. 4.1
d). The width of each bar of the histogram relates to a range of values for the variable. For example, the baby’s weight (
Fig. 4.1
d) may be categorized into 1.75–1.99 kg, 2.00–2.24 kg, …, 4.25–4.49 kg. The area of the bar is proportional to the frequency in that range. Therefore, if one of the groups covers a wider range than the others, its base will be wider and height shorter to compensate. Usually, between five and 20 groups are chosen; the ranges should be narrow enough to illustrate patterns in the data, but should not be so narrow that they are the raw data. The histogram should be labelled carefully to make it clear where the boundaries lie.
Dot plot
– each observation is represented by one dot on a horizontal (or vertical) line (
Fig. 4.1
e). This type of plot is very simple to draw, but can be cumbersome with large data sets. Often a summary measure of the data, such as the mean or median (
Chapter 5
), is shown on the diagram. This plot may also be used for discrete data.
Stem-and-leaf plot
– this is a mixture of a diagram and a table; it looks similar to a histogram turned on its side, and is effectively the data values written in increasing order of size. It is usually drawn with a vertical
stem
, consisting of the first few digits of the values, arranged in order. Protruding from this stem are the
leaves
– i.e. the final digit of each of the ordered values, which are written horizontally (
Fig. 4.2
) in increasing numerical order.
Box plot
(often called a
box-and-whisker plot
) – this is a vertical or horizontal rectangle, with the ends of the rectangle corresponding to the upper and lower quartiles of the data values (
Chapter 6
). A line drawn through the rectangle corresponds to the median value (
Chapter 5
). Whiskers, starting at the ends of the rectangle, usually indicate minimum and maximum values but sometimes relate to particular percentiles, e.g. the 5th and 95th percentiles (Fig. 6.1). Outliers may be marked.
The choice of the most appropriate statistical method will often depend on the shape of the distribution. The distribution of the data is usually unimodal in that it has a single ‘peak’. Sometimes the distribution is bimodal (two peaks) or uniform (each value is equally likely and there are no peaks). When the distribution is unimodal, the main aim is to see where the majority of the data values lie, relative to the maximum and minimum values. In particular, it is important to assess whether the distribution is:
symmetrical
– centred around some mid-point, with one side being a mirror-image of the other (
Fig. 5.1
);
skewed to the right (positively skewed)
– a long tail to the right with one or a few high values. Such data are common in medical research (
Fig. 5.2
);
skewed to the left (negatively skewed)
– a long tail to the left with one or a few low values (
Fig. 4.1
d).
If one variable is categorical, then separate diagrams showing the distribution of the second variable can be drawn for each of the categories. Other plots suitable for such data include clustered or segmented bar or column charts (Fig. 4.1c).
If both of the variables are numerical or ordinal, then the relationship between the two can be illustrated using a scatter diagram (Fig. 4.1f). This plots one variable against the other in a two-way diagram. One variable is usually termed the x variable and is represented on the horizontal axis. The second variable, known as the y variable, is plotted on the vertical axis.
We can often use single-variable data displays to identify outliers. For example, a very long tail on one side of a histogram may indicate an outlying value. However, outliers may sometimes only become apparent when considering the relationship between two variables. For example, a weight of 55 kg would not be unusual for a woman who was 1.6 m tall, but would be unusually low if the woman’s height was 1.9 m.
The use of connecting lines in scatter diagrams may be misleading. Connecting lines suggest that the values on the x-axis are ordered in some way – this might be the case if, for example, the x-axis reflects some measure of time or dose. Where this is not the case, the points should not be joined with a line. Conversely, if there is a dependency between different points (e.g. because they relate to results from the same individual at two different time points, such as before and after treatment), it is helpful to connect the relevant points by a straight line (Fig. 20.1) and important information may be lost if these lines are omitted.
Learning objectives
By the end of this chapter, you should be able to:
Explain what is meant by an average
Describe the appropriate use of each of the following types of average: arithmetic mean, mode, median, geometric mean, weighted mean
Explain how to calculate each type of average
List the advantages and disadvantages of each type of average
Relevant Workbook questions: MCQs 1, 10, 11, 12, 13, 19 and 39; and SQs 2, 3, 4 and 9 available online
It is very difficult to have any ‘feeling’ for a set of numerical measurements unless we can summarize the data in a meaningful way. A diagram (Chapter 4) is often a useful starting point. We can also condense the information by providing measures that describe the important characteristics of the data. In particular, if we have some perception of what constitutes a representative value, and if we know how widely scattered the observations are around it, then we can formulate an image of the data. The average is a general term for a measure of location; it describes a typical measurement. We devote this chapter to averages, the most common being the mean and median (Table 5.1). We introduce measures that describe the scatter or spread of the observations in Chapter 6.
Table 5.1 Advantages and disadvantages of averages.
Type of average
Advantages
Disadvantages
Mean
Uses all the data values
Algebraically defined and so mathematically manageable
Known sampling distribution (
Chapter 9
)
Distorted by outliers
Distorted by skewed data
Median
Not distorted by outliers
Not distorted by skewed data
Ignores most of the information
Not algebraically defined
Complicated sampling distribution
Mode
Easily determined for categorical data
Ignores most of the information
Not algebraically defined
Unknown sampling distribution
Geometric mean
Before back-transformation, it has the same advantages as the mean
Appropriate for right-skewed data
Only appropriate if the log transformation produces a symmetrical distribution
Weighted mean
Same advantages as the mean
Ascribes relative importance to each observation
Algebraically defined
Weights must be known or estimated
The arithmetic mean, often simply called the mean, of a set of values is calculated by adding up all the values and dividing this sum by the number of values in the set.
It is useful to be able to summarize this verbal description by an algebraic formula. Using mathematical notation, we write our set of n observations of a variable, x, as x1, x2, x3, …, xn. For example, x might represent an individual’s height (cm), so that x1 represents the height of the first individual, and xi the height of the i
