Medical Statistics at a Glance - Aviva Petrie - E-Book

Medical Statistics at a Glance E-Book

Aviva Petrie

0,0
35,99 €

-100%
Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.

Mehr erfahren.
Beschreibung

Now in its fourth edition, Medical Statistics at a Glance is a concise and accessible introduction to this complex subject. It provides clear instruction on how to apply commonly used statistical procedures in an easy-to-read, comprehensive and relevant volume. This new edition continues to be the ideal introductory manual and reference guide to medical statistics, an invaluable companion for statistics lectures and a very useful revision aid.

This new edition of Medical Statistics at a Glance:

  • Offers guidance on the practical application of statistical methods in conducting research and presenting results
  • Explains the underlying concepts of medical statistics and presents the key facts without being unduly mathematical
  • Contains succinct self-contained chapters, each with one or more examples, many of them new, to illustrate the use of the methodology described in the chapter.
  • Now provides templates for critical appraisal, checklists for the reporting of randomized controlled trials and observational studies and references to the EQUATOR guidelines for the presentation of study results for many other types of study
  • Includes extensive cross-referencing, flowcharts to aid the choice of appropriate tests, learning objectives for each chapter, a glossary of terms and a glossary of annotated full computer output relevant to the examples in the text
  • Provides cross-referencing to the multiple choice and structured questions in the companion Medical Statistics at a Glance Workbook

Medical Statistics at a Glance is a must-have text for undergraduate and post-graduate medical students, medical researchers and biomedical and pharmaceutical professionals.

Sie lesen das E-Book in den Legimi-Apps auf:

Android
iOS
von Legimi
zertifizierten E-Readern

Seitenzahl: 636

Veröffentlichungsjahr: 2019

Bewertungen
0,0
0
0
0
0
0
Mehr Informationen
Mehr Informationen
Legimi prüft nicht, ob Rezensionen von Nutzern stammen, die den betreffenden Titel tatsächlich gekauft oder gelesen/gehört haben. Wir entfernen aber gefälschte Rezensionen.



This title is also available as an e-book.

For more details, please see

www.wiley.com/buy/9781119167815

Also available to buy!

Medical Statistics at a Glance Workbook

A comprehensive workbook containing a variety of examples and exercises, complete with model answers, designed to support your learning and revision.

Fully cross-referenced to Medical Statistics at a Glance, this workbook includes:

Over 80 MCQs, each testing knowledge of a single statistical concept or aspect of study interpretation

29 structured questions to explore in greater depth several statistical techniques or principles

Full appraisals of two published papers to demonstrate the use of templates for clinical trials and observational studies

Detailed step-by-step analyses of two substantial data sets (also available at

www.medstatsaag.com

) to demonstrate the application of statistical procedures to real-life research

Medical Statistics at a Glance Workbook is the ideal resource to improve statistical knowledge together with your analytical and interpretational skills.

Medical Statistics

at a Glance

Fourth Edition

Aviva Petrie

Honorary Associate Professor of Biostatistics

UCL Eastman Dental Institute

London, UK

Caroline Sabin

Professor of Medical Statistics and Epidemiology

Institute for Global Health

UCL

London, UK

This edition first published 2020

© 2020 Aviva Petrie and Caroline Sabin

Edition History

Aviva Petrie and Caroline Sabin (1e, 2000; 2e, 2005; 3e, 2009).

All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, recording or otherwise, except as permitted by law. Advice on how to obtain permission to reuse material from this title is available at http://www.wiley.com/go/permissions.

The right of Aviva Petrie and Caroline Sabin to be identified as the authors of this work has been asserted in accordance with law.

Registered Office(s)

John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, USA

John Wiley & Sons Ltd, The Atrium, Southern Gate, Chichester, West Sussex, PO19 8SQ, UK

Editorial Office

9600 Garsington Road, Oxford, OX4 2DQ, UK

For details of our global editorial offices, customer services, and more information about Wiley products visit us at www.wiley.com.

Wiley also publishes its books in a variety of electronic formats and by print-on-demand. Some content that appears in standard print versions of this book may not be available in other formats.

Limit of Liability/Disclaimer of Warranty

The contents of this work are intended to further general scientific research, understanding, and discussion only and are not intended and should not be relied upon as recommending or promoting scientific method, diagnosis, or treatment by physicians for any particular patient. In view of ongoing research, equipment modifications, changes in governmental regulations, and the constant flow of information relating to the use of medicines, equipment, and devices, the reader is urged to review and evaluate the information provided in the package insert or instructions for each medicine, equipment, or device for, among other things, any changes in the instructions or indication of usage and for added warnings and precautions. While the publisher and authors have used their best efforts in preparing this work, they make no representations or warranties with respect to the accuracy or completeness of the contents of this work and specifically disclaim all warranties, including without limitation any implied warranties of merchantability or fitness for a particular purpose. No warranty may be created or extended by sales representatives, written sales materials or promotional statements for this work. The fact that an organization, website, or product is referred to in this work as a citation and/or potential source of further information does not mean that the publisher and authors endorse the information or services the organization, website, or product may provide or recommendations it may make. This work is sold with the understanding that the publisher is not engaged in rendering professional services. The advice and strategies contained herein may not be suitable for your situation. You should consult with a specialist where appropriate. Further, readers should be aware that websites listed in this work may have changed or disappeared between when this work was written and when it is read. Neither the publisher nor authors shall be liable for any loss of profit or any other commercial damages, including but not limited to special, incidental, consequential, or other damages.

Library of Congress Cataloging-in-Publication Data

Names: Petrie, Aviva, author. | Sabin, Caroline, author.

Title: Medical statistics at a glance / Aviva Petrie, Caroline Sabin.

Description: Fourth edition. | Hoboken, NJ : Wiley-Blackwell, 2020. |

   Includes bibliographical references and index. |

Identifiers: LCCN 2019008181 (print) | LCCN 2019008704 (ebook) | ISBN

   9781119167822 (Adobe PDF) | ISBN 9781119167839 (ePub) | ISBN 9781119167815

   (pbk.)

Subjects: | MESH: Statistics as Topic | Research Design

Classification: LCC R853.S7 (ebook) | LCC R853.S7 (print) | NLM WA 950 | DDC

   610.72/7—dc23

LC record available at https://lccn.loc.gov/2019008181

Cover Design: Wiley

Cover Image: © Somyot Techapuwapat/EyeEm/Getty Images

CONTENTS

Cover

Also available to buy!

Preface

Part 1 Handling data

1 Types of data

Data and statistics

Categorical (qualitative) data

Numerical (quantitative) data

Distinguishing between data types

Derived data

Censored data

2 Data entry

Formats for data entry

Planning data entry

Categorical data

Numerical data

Multiple forms per patient

Problems with dates and times

Coding missing values

3 Error checking and outliers

Typing errors

Error checking

Handling missing data

Outliers

References

4 Displaying data diagrammatically

One variable

Two variables

Identifying outliers using graphical methods

The use of connecting lines in diagrams

5 Describing data: the ‘average’

Summarizing data

The arithmetic mean

The median

The mode

The geometric mean

The weighted mean

6 Describing data: the ‘spread’

Summarizing data

The range

Ranges derived from percentiles

The standard deviation

Variation within- and between-subjects

7 Theoretical distributions: the Normal distribution

Understanding probability

The rules of probability

Probability distributions: the theory

The Normal (Gaussian) distribution

The Standard Normal distribution

8 Theoretical distributions: other distributions

Some words of comfort

More continuous probability distributions

Discrete probability distributions

9 Transformations

Why transform?

How do we transform?

Typical transformations

Part 2 Sampling and estimation

10 Sampling and sampling distributions

Why do we sample?

Obtaining a representative sample

Point estimates

Sampling variation

Sampling distribution of the mean

Interpreting standard errors

SD or SEM?

Sampling distribution of the proportion

11 Confidence intervals

Confidence interval for the mean

Confidence interval for the proportion

Interpretation of confidence intervals

Degrees of freedom

Bootstrapping and jackknifing

Reference

Part 3 Study design

12 Study design I

Experimental or observational studies

Defining the unit of observation

Multicentre studies

Assessing causality

Cross-sectional or longitudinal studies

Controls

Bias

Reference

13 Study design II

Variation

Replication

Sample size

Particular study designs

Choosing an appropriate study endpoint

References

14 Clinical trials

Treatment comparisons

Primary and secondary endpoints

Subgroup analyses

Treatment allocation

Sequential trials

Blinding or masking

Patient issues

The protocol

References

15 Cohort studies

Selection of cohorts

Follow-up of individuals

Information on outcomes and exposures

Analysis of cohort studies

Advantages of cohort studies

Disadvantages of cohort studies

Study management

Clinical cohorts

16 Case–control studies

Selection of cases

Selection of controls

Identification of risk factors

Matching

Analysis of unmatched or group-matched case–control studies

Analysis of individually matched case–control studies

Advantages of case–control studies

Disadvantages of case–control studies

References

Part 4 Hypothesis testing

17 Hypothesis testing

Defining the null and alternative hypotheses

Obtaining the test statistic

Obtaining the

P

-value

Using the

P

-value

Non-parametric tests

Which test?

Hypothesis tests versus confidence intervals

Equivalence and non-inferiority trials

References

18 Errors in hypothesis testing

Making a decision

Making the wrong decision

Power and related factors

Multiple hypothesis testing

References

Part 5 Basic techniques for analysing data

19 Numerical data: a single group

The problem

The one-sample

t

-test

The sign test

20 Numerical data: two related groups

The problem

The paired

t

-test

The Wilcoxon signed ranks test

Reference

21 Numerical data: two unrelated groups

The problem

The unpaired (two-sample)

t

-test

The Wilcoxon rank sum (two-sample) test

Reference

22 Numerical data: more than two groups

The problem

One-way analysis of variance

The Kruskal–Wallis test

References

23 Categorical data: a single proportion

The problem

The test of a single proportion

The sign test applied to a proportion

24 Categorical data: two proportions

The problems

Independent groups: the Chi-squared test

Related groups: McNemar’s test

Reference

25 Categorical data: more than two categories

Chi-squared test: large contingency tables

Chi-squared test for trend

26 Correlation

Pearson correlation coefficient

Spearman’s rank correlation coefficient

27 The theory of linear regression

What is linear regression?

The regression line

Method of least squares

Assumptions

Analysis of variance table

Regression to the mean

28 Performing a linear regression analysis

The linear regression line

Drawing the line

Checking the assumptions

Failure to satisfy the assumptions

Outliers and influential points

Assessing goodness of fit

Investigating the slope

Using the line for prediction

Improving the interpretation of the model

29 Multiple linear regression

What is it?

Why do it?

Assumptions

Categorical explanatory variables

Analysis of covariance

Choice of explanatory variables

Analysis

Outliers and influential points

Reference

30 Binary outcomes and logistic regression

Reasoning

The logistic regression equation

The explanatory variables

Assessing the adequacy of the model

Comparing the odds ratio and the relative risk

Multinomial and ordinal logistic regression

Conditional logistic regression

References

31 Rates and Poisson regression

Rates

Poisson regression

32 Generalized linear models

Which type of model do we choose?

Likelihood and maximum likelihood estimation

Assessing adequacy of fit

Regression diagnostics

33 Explanatory variables in statistical models

Nominal explanatory variables

Ordinal explanatory variables

Numerical explanatory variables

Selecting explanatory variables

Interaction

Collinearity

Confounding

34 Bias and confounding

Bias

Confounding

References

35 Checking assumptions

Why bother?

Are the data Normally distributed?

Are two or more variances equal?

Are variables linearly related?

What if the assumptions are not satisfied?

Sensitivity analysis

References

36 Sample size calculations

The importance of sample size

Requirements

Methodology

Altman’s nomogram

Quick formulae

Power statement

Adjustments

Increasing the power for a fixed sample size

References

37 Presenting results

Numerical results

Tables

Diagrams

Presenting results in a paper

References

Part 6 Additional chapters

38 Diagnostic tools

Reference intervals

Diagnostic tests

39 Assessing agreement

Measurement variability and error

Reliability

Categorical variables

Numerical variables

Reporting guidelines

References

40 Evidence-based medicine

1 Formulate the clinical question (PICO)

2 Locate the relevant information (e.g. on diagnosis, prognosis or therapy)

3 Critically appraise the methods in order to assess the validity (closeness to the truth) of the evidence

4 Extract the most useful results and determine whether they are important

5 Apply the results in clinical practice

6 Evaluate your performance

References

41 Methods for clustered data

Displaying the data

Comparing groups: inappropriate analyses

Comparing groups: appropriate analyses

Reference

42 Regression methods for clustered data

Aggregate level analysis

Robust standard errors

Random effects models

Generalized estimating equations (GEE)

References

43 Systematic reviews and meta-analysis

The systematic review

Meta-analysis

References

44 Survival analysis

Censored data

Displaying survival data

Summarizing survival

Comparing survival

Problems encountered in survival analysis

Reference

45 Bayesian methods

The frequentist approach

The Bayesian approach

Diagnostic tests in a Bayesian framework

Disadvantages of Bayesian methods

Reference

46 Developing prognostic scores

Why do we do it?

Assessing the performance of a prognostic score

Developing prognostic indices and risk scores for other types of data

Reporting guidelines

Appendices

Appendix A: Statistical tables

Reference

Appendix B: Altman’s nomogram for sample size calculations (Chapter 36)

Appendix C: Typical computer output

Appendix D: Checklists and trial profile from the EQUATOR network and critical appraisal templates

Equator Network Statements

Critical Appraisal Templates

Reference

Appendix E: Glossary of terms

Appendix F: Chapter numbers with relevant multiple-choice questions and structured questions from

Medical Statistics at a Glance Workbook

Index

End User License Agreement

List of Tables

Chapter 5

Table 5.1

Chapter 6

Table 6.1

Chapter 12

Table 12.1

Chapter 13

Table 13.1

Chapter 15

Table 15.1

Table 15.2

Chapter 16

Table 16.1

Table 16.2

Chapter 18

Table 18.1

Chapter 20

Table 20.1

Chapter 21

Table 21.1

Chapter 24

Table 24.1

Table 24.2

Chapter 25

Table 25.1

Table 25.2

Chapter 28

Table 28.1

Table 28.2

Chapter 31

Table 31.1

Chapter 32

Table 32.1

Table 32.2

Chapter 33

Table 33.1

Chapter 34

Table 34.1

Chapter 36

Table 36.1

Chapter 38

Table 38.1

Chapter 39

Table 39.1

Table 39.2

Chapter 42

Table 42.1

Table 42.2

Chapter 43

Table 43.1

Chapter 44

Table 44.1

Chapter 46

Table 46.1

Appendix A

Table A1

Table A2

Table A3

Table A4

Table A5

Table A6

Table A7

Table A8

Table A9(a)

Table A9(b)

Table A10

Table A11

Table A12

Appendix D

Table D1

Table D2

Guide

Cover

Table of Contents

Preface

Pages

ii

iii

v

vi

ix

x

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

51

52

53

54

55

56

57

58

59

60

61

62

63

64

65

66

67

68

69

70

71

72

73

74

75

76

77

78

79

80

81

82

83

84

85

86

87

88

89

90

91

92

93

94

95

96

97

98

99

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125

126

127

128

129

130

131

132

133

134

135

136

137

138

139

140

141

142

143

144

145

146

147

148

149

150

151

152

153

154

155

156

157

158

159

160

161

162

163

164

165

166

167

168

169

170

171

172

173

174

175

176

177

178

179

180

181

182

183

184

185

186

187

188

189

190

191

192

193

194

195

196

197

198

Preface

Medical Statistics at a Glance is directed at undergraduate medical students, medical researchers, postgraduates in the biomedical disciplines and at pharmaceutical industry personnel. All of these individuals will, at some time in their professional lives, be faced with quantitative results (their own or those of others) which will need to be critically evaluated and interpreted, and some, of course, will have to pass that dreaded statistics exam! A proper understanding of statistical concepts and methodology is invaluable for these needs. Much as we should like to fire the reader with an enthusiasm for the subject of statistics, we are pragmatic. Our aim in this new edition, as it was in the earlier editions, is to provide the student and the researcher, as well as the clinician encountering statistical concepts in the medical literature, with a book which is sound, easy to read, comprehensive, relevant, and of useful practical application.

We believe Medical Statistics at a Glance will be particularly helpful as an adjunct to statistics lectures and as a reference guide. The structure of this fourth edition is the same as that of the first three editions. In line with other books in the At a Glance series, we lead the reader through a number of self-contained two-, three- or occasionally four-page chapters, each covering a different aspect of medical statistics. There is extensive cross-referencing throughout the text to help the reader link the various procedures. We have learned from our own teaching experiences and have taken account of the difficulties that our students have encountered when studying medical statistics. For this reason, we have chosen to limit the theoretical content of the book to a level that is sufficient for understanding the procedures involved, yet which does not overshadow the practicalities of their execution.

Medical statistics is a wide-ranging subject covering a large number of topics. We have provided a basic introduction to the underlying concepts of medical statistics and a guide to the most commonly used statistical procedures. Epidemiology, concerned with the distribution and determinants of disease in specified populations, is closely allied to medical statistics. Hence some of the main issues in epidemiology, relating to study design and interpretation, are discussed. Also included are chapters that the reader may find useful only occasionally, but which are, nevertheless, fundamental to many areas of medical research; for example, evidence-based medicine, systematic reviews and meta-analysis, survival analysis, Bayesian methods and the development of prognostic scores. We have explained the principles underlying these topics so that the reader will be able to understand and interpret the results from them when they are presented in the literature.

A basic set of statistical tables is contained in Appendix A. Neave, H.R. (1995) Elementary Statistical Tables, Routledge: London, and Diem, K. Lenter, C. and Seldrup (1981) Geigy Scientific Tables, 8th rev. and enl. edition, Basle: Ciba-Geigy, amongst others, provide fuller versions if the reader requires more precise results for hand calculations. We have included a new appendix, Appendix D, in this fourth edition. This appendix contains guidelines for randomized controlled trials (the CONSORT checklist and flow chart) and observational studies (the STROBE checklist). The CONSORT and STROBE checklists are produced by the EQUATOR Network, initiated with the objectives of providing resources and training for the reporting of health research. Guidelines for the presentation of study results are now available for many other types of study and we provide website addresses in a table in Appendix D for some of these designs. Appendix D also contains templates that we hope you will find useful when you critically appraise or evaluate the evidence in randomized controlled trials and observational studies. The use of these templates to critically appraise two published papers is demonstrated in our Medical Statistics at a Glance Workbook. Due to the inclusion of the new Appendix D, the labeling of the final two appendices differs from that of the third edition: Appendix E now contains the Glossary of terms with readily accessible explanations of commonly used terminology, and Appendix F provides cross-referencing of multiple choice and structured questions from Medical Statistics at a Glance Workbook.

The chapter titles of this fourth edition are identical to those of the third edition. Some of the first 46 chapters remain unaltered in this new edition and some have relatively minor changes which accommodate recent advances, cross-referencing or re-organization of the new material. In particular, where appropriate, we have provided references to the relevant EQUATOR guidelines.

As in the third edition, we provide a set of learning objectives for each chapter. Each set provides a framework for evaluating understanding and progress. If you are able to complete all the bulleted tasks in a chapter satisfactorily, you will have mastered the concepts in that chapter.

Most of the statistical techniques described in the book are accompanied by examples illustrating their use. We have replaced many of the older examples that were in previous editions by those that are commensurate with current clinical research. We have generally obtained the data for our examples from collaborative studies in which we or colleagues have been involved; in some instances, we have used real data from published papers. Where possible, we have used the same data set in more than one chapter to reflect the reality of data analysis, which is rarely restricted to a single technique or approach. Although we believe that formulae should be provided and the logic of the approach explained as an aid to understanding, we have avoided showing the details of complex calculations – most readers will have access to computers and are unlikely to perform any but the simplest calculations by hand.

We consider that it is particularly important for the reader to be able to interpret output from a computer package. We have therefore chosen, where applicable, to show results using extracts from computer output. In some instances, where we believe individuals may have difficulty with its interpretation, we have included (Appendix C) and annotated the complete computer output from an analysis of a data set. There are many statistical packages in common use; to give the reader an indication of how output can vary, we have not restricted the output to a particular package and have, instead, used four well-known ones – SAS, SPSS, Stata and R.

We know that one of the greatest difficulties facing non-statisticians is choosing the appropriate technique. We have therefore produced two flow charts which can be used both to aid the decision as to what method to use in a given situation and to locate a particular technique in the book easily. These flow charts are displayed prominently on the inside back cover for easy access.

The reader may find it helpful to assess his/her progress in self-directed learning by attempting the interactive exercises on our website (www.medstatsaag.com) or the multiple choice and structured questions, all with model answers, in our Medical Statistics at a Glance Workbook. The website also contains a full set of references (some of which are linked directly to Medline) to supplement the references quoted in the text and provide useful background information for the examples. For those readers who wish to gain a greater insight into particular areas of medical statistics, we can recommend the following books:

Altman, D.G. (1991)

Practical Statistics for Medical Research

. London: Chapman and Hall/CRC.

Armitage, P., Berry, G. and Matthews, J.F.N. (2001)

Statistical Methods in Medical Research.

4th edition. Oxford: Blackwell Science.

Kirkwood, B.R. and Sterne, J.A.C. (2003)

Essential Medical Statistics.

2nd edition. Oxford: Blackwell Publishing.

Pocock, S.J. (1983)

Clinical Trials: A Practical Approach

. Chichester: Wiley.

We are extremely grateful to Mark Gilthorpe and Jonathan Sterne who made invaluable comments and suggestions on aspects of the second edition, and to Richard Morris, Fiona Lampe, Shak Hajat and Abul Basar for their counsel on the first edition. We wish to thank everyone who has helped us by providing data for the examples. Naturally, we take full responsibility for any errors that remain in the text or examples. We should also like to thank Mike, Gerald, Nina, Andrew and Karen who tolerated, with equanimity, our preoccupation with the first three editions and for their unconditional support, patience and encouragement as we laboured to produce this fourth edition.

Aviva Petrie

Caroline Sabin

London

Part 1Handling data

Chapters

1

Types of data

2

Data entry

3

Error checking and outliers

4

Displaying data diagrammatically

5

Describing data: the ‘average’

6

Describing data: the ‘spread’

7

Theoretical distributions: the Normal distribution

8

Theoretical distributions: other distributions

9

Transformations

1Types of data

Learning objectives

By the end of this chapter, you should be able to:

Distinguish between a sample and a population

Distinguish between categorical and numerical data

Describe different types of categorical and numerical data

Explain the meaning of the terms: variable, percentage, ratio, quotient, rate, score

Explain what is meant by censored data

Relevant Workbook questions: MCQs 1, 2 and 16; and SQ 1 available online

Data and statistics

The purpose of most studies is to collect data to obtain information about a particular area of research. Our data comprise observations on one or more variables; any quantity that varies is termed a variable. For example, we may collect basic clinical and demographic information on patients with a particular illness. The variables of interest may include the sex, age and height of the patients.

Our data are usually obtained from a sample of individuals that represents the population of interest. Our aim is to condense these data in a meaningful way and extract useful information from them. Statistics encompasses the methods of collecting, summarizing, analysing and drawing conclusions from the data: we use statistical techniques to achieve our aim.

Data may take many different forms. We need to know what form every variable takes before we can make a decision regarding the most appropriate statistical methods to use. Each variable and the resulting data will be one of two types: categorical or numerical (Fig. 1.1).

Categorical (qualitative) data

These occur when each individual can only belong to one of a number of distinct categories of the variable.

Nominal data

– the categories are not ordered but simply have names. Examples include blood group (A, B, AB and O) and marital status (married/widowed/single, etc.). In this case, there is no reason to suspect that being married is any better (or worse) than being single!

Ordinal data

– the categories are ordered in some way. Examples include disease staging systems (advanced, moderate, mild, none) and degree of pain (severe, moderate, mild, none).

A categorical variable is binary or dichotomous when there are only two possible categories. Examples include ‘Yes/No’, ‘Dead/Alive’ or ‘Patient has disease/Patient does not have disease’.

Numerical (quantitative) data

These occur when the variable takes some numerical value. We can subdivide numerical data into two types.

Discrete data

– occur when the variable can only take certain whole numerical values. These are often counts of numbers of events, such as the number of visits to a GP in a particular year or the number of episodes of illness in an individual over the last 5 years.

Continuous data

– occur when there is no limitation on the values that the variable can take, e.g. weight or height, other than that which restricts us when we make the measurement.

Distinguishing between data types

We often use very different statistical methods depending on whether the data are categorical or numerical. Although the distinction between categorical and numerical data is usually clear, in some situations it may become blurred. For example, when we have a variable with a large number of ordered categories (e.g. a pain scale with seven categories), it may be difficult to distinguish it from a discrete numerical variable. The distinction between discrete and continuous numerical data may be even less clear, although in general this will have little impact on the results of most analyses. Age is an example of a variable that is often treated as discrete even though it is truly continuous. We usually refer to ‘age at last birthday’ rather than ‘age’, and therefore, a woman who reports being 30 may have just had her 30th birthday, or may be just about to have her 31st birthday.

Do not be tempted to record numerical data as categorical at the outset (e.g. by recording only the range within which each patient’s age falls rather than his/her actual age) as important information is often lost. It is simple to convert numerical data to categorical data once they have been collected.

Derived data

We may encounter a number of other types of data in the medical field. These include:

Percentages

– these may arise when considering improvements in patients following treatment, e.g. a patient’s lung function (forced expiratory volume in 1 second, FEV1) may increase by 24% following treatment with a new drug. In this case, it is the level of improvement, rather than the absolute value, which is of interest.

Ratios

or

quotients

– occasionally you may encounter the ratio or quotient of two variables. For example, body mass index (BMI), calculated as an individual’s weight (kg) divided by her/his height squared (m

2

), is often used to assess whether s/he is over- or underweight.

Rates

– disease rates, in which the number of disease events occurring among individuals in a study is divided by the total number of years of follow-up of all individuals in that study (

Chapter 31

), are common in epidemiological studies (

Chapter 12

).

Scores

– we sometimes use an arbitrary value, such as a score, when we cannot measure a quantity. For example, a series of responses to questions on quality of life may be summed to give some overall quality of life score on each individual.

All these variables can be treated as numerical variables for most analyses. Where the variable is derived using more than one value (e.g. the numerator and denominator of a percentage), it is important to record all of the values used. For example, a 10% improvement in a marker following treatment may have different clinical relevance depending on the level of the marker before treatment.

Censored data

We may come across censored data in situations illustrated by the following examples.

If we measure laboratory values using a tool that can only detect levels above a certain cut-off value, then any values below this cut-off will not be detected, i.e. they are censored. For example, when measuring virus levels, those below the limit of detectability will often be reported as ‘undetectable’ or ‘unquantifiable’ even though there may be some virus in the sample. In this situation, if the lower cut-off of a tool is

x

, say, the results may be reported as ‘ <

x

’. Similarly, some tools may only be able to reliably quantify levels below a certain cut-off value, say

y

; any measurements above that value will also be censored and the test result may be reported as ‘ >

y

’.

We may encounter censored data when following patients in a trial in which, for example, some patients withdraw from the trial before the trial has ended. This type of data is discussed in more detail in

Chapter 44

.

2Data entry

Learning objectives

By the end of this chapter, you should be able to:

Describe different formats for entering data on to a computer

Outline the principles of questionnaire design

Distinguish between single-coded and multi-coded variables

Describe how to code missing values

Relevant Workbook questions: MCQs 1, 3 and 4; and SQ 1 available online

When you carry out any study you will almost always need to enter the data into a computer package. Computers are invaluable for improving the accuracy and speed of data collection and analysis, making it easy to check for errors, produce graphical summaries of the data and generate new variables. It is worth spending some time planning data entry – this may save considerable effort at later stages.

Formats for data entry

There are a number of ways in which data can be entered and stored on a computer. Most statistical packages allow you to enter data directly. However, the limitation of this approach is that often you cannot move the data to another package. A simple alternative is to store the data in either a spreadsheet or database package. Unfortunately, their statistical procedures are often limited, and it will usually be necessary to output the data into a specialist statistical package to carry out analyses.

A more flexible approach is to have your data available as an ASCII or text file. Once in an ASCII format, the data can be read by most packages. ASCII format simply consists of rows of text that you can view on a computer screen. Usually, each variable in the file is separated from the next by some delimiter, often a space or a comma. This is known as free format.

The simplest way of entering data in ASCII format is to type the data directly in this format using either a word processing or editing package. Alternatively, data stored in spreadsheet packages can be saved in ASCII format. Using either approach, it is customary for each row of data to correspond to a different individual in the study, and each column to correspond to a different variable, although it may be necessary to go on to subsequent rows if data from a large number of variables are collected on each individual.

Planning data entry

When collecting data in a study you will often need to use a form or questionnaire for recording the data. If these forms are designed carefully, they can reduce the amount of work that has to be done when entering the data. Generally, these forms/questionnaires include a series of boxes in which the data are recorded – it is usual to have a separate box for each possible digit of the response.

Categorical data

Some statistical packages have problems dealing with non-numerical data. Therefore, you may need to assign numerical codes to categorical data before entering the data into the computer. For example, you may choose to assign the codes of 1, 2, 3 and 4 to categories of ‘no pain’, ‘mild pain’, ‘moderate pain’ and ‘severe pain’, respectively. These codes can be added to the forms when collecting the data. For binary data, e.g. yes/no answers, it is often convenient to assign the codes 1 (e.g. for ‘yes’) and 0 (for ‘no’).

Single-coded

variables – there is only one possible answer to a question, e.g. ‘Is the patient dead?’ It is not possible to answer both ‘yes’ and ‘no’ to this question.

Multi-coded

variables – more than one answer is possible for each respondent. For example, ‘What symptoms has this patient experienced?’ In this case, an individual may have experienced any of a number of symptoms. There are two ways to deal with this type of data depending upon which of the two following situations applies.

There are only a few possible symptoms, and individuals may have experienced many of them.

A number of different binary variables can be created that correspond to whether the patient has answered yes or no to the presence of each possible symptom. For example, ‘Did the patient have a cough?’, ‘Did the patient have a sore throat?’

There are a very large number of possible symptoms but each patient is expected to suffer from only a few of them.

A number of different nominal variables can be created; each successive variable allows you to name a symptom suffered by the patient. For example, ‘What was the first symptom the patient suffered?’, ‘What was the second symptom?’ You will need to decide in advance the maximum number of symptoms you think a patient is likely to have suffered.

Numerical data

Numerical data should be entered with the same precision as they are measured, and the unit of measurement should be consistent for all observations on a variable. For example, weight should be recorded in kilograms or in pounds, but not both interchangeably.

Multiple forms per patient

Sometimes, information is collected on the same patient on more than one occasion. It is important that there is some unique identifier (e.g. a serial number) relating to the individual that will enable you to link all of the data from an individual in the study.

Problems with dates and times

Dates and times should be entered in a consistent manner, e.g. either as day/month/year or month/day/year, but not interchangeably. It is important to find out what format the statistical package can read.

Coding missing values

You should consider what you will do with missing values before you enter the data. In most cases you will need to use some symbol to represent a missing value. Statistical packages deal with missing values in different ways. Some use special characters (e.g. a full stop or asterisk) to indicate missing values, whereas others require you to define your own code for a missing value (commonly used values are 9, 999 or − 99). The value that is chosen should be one that is not possible for that variable. For example, when entering a categorical variable with four categories (coded 1, 2, 3 and 4), you may choose the value 9 to represent missing values. However, if the variable is ‘age of child’ then a different code should be chosen. Missing data are discussed in more detail in Chapter 3.

Example

As part of a study on the effect of inherited bleeding disorders on pregnancy and childbirth, data were collected on a sample of 64 women registered at a single haemophilia centre in London. The women were asked questions relating to their bleeding disorder and their first pregnancy (or their current pregnancy if they were pregnant for the first time on the date of interview). Figure 2.1 shows the data from a small selection of the women after the data have been entered onto a spreadsheet, but before they have been checked for errors. The coding schemes for the categorical variables are shown at the bottom of Fig. 2.1. Each row of the spreadsheet represents a separate individual in the study; each column represents a different variable. Where the woman is still pregnant, the age of the woman at the time of birth has been calculated from the estimated date of the baby’s delivery. Data relating to the live births are summarized in Table 37.1 in Chapter 37.

Data kindly provided by Dr R.A. Kadir, University Department of Obstetrics and Gynaecology, and Professor C.A. Lee, Haemophilia Centre and Haemostasis Unit, Royal Free Hospital, London.

3Error checking and outliers

Learning objectives

By the end of this chapter, you should be able to:

Describe how to check for errors in data

Distinguish between data that are missing completely at random, missing at random and missing not at random

Outline the methods of dealing with missing data, distinguishing between single and multiple imputation

Define an outlier

Explain how to check for and handle outliers

Relevant Workbook questions: MCQs 5 and 6; and SQs 1 and 28 available online

In any study there is always the potential for errors to occur in a data set, either at the outset when taking measurements, or when collecting, transcribing and entering the data into a computer. It is hard to eliminate all of these errors. However, you can reduce the number of typing and transcribing errors by checking the data carefully once they have been entered. Simply scanning the data by eye will often identify values that are obviously wrong. In this chapter we suggest a number of other approaches that you can use when checking data.

Typing errors

Typing mistakes are the most frequent source of errors when entering data. If the amount of data is small, then you can check the typed data set against the original forms/questionnaires to see whether there are any typing mistakes. However, this is time-consuming if the amount of data is large. It is possible to type the data in twice and compare the two data sets using a computer program. Any differences between the two data sets will reveal typing mistakes. Although this approach does not rule out the possibility that the same error has been incorrectly entered on both occasions, or that the value on the form/questionnaire is incorrect, it does at least minimize the number of errors. The disadvantage of this method is that it takes twice as long to enter the data, which may have major cost or time implications.

Error checking

Categorical data

– it is relatively easy to check categorical data, as the responses for each variable can only take one of a number of limited values. Therefore, values that are not allowable must be errors.

Numerical data

– numerical data are often difficult to check but are prone to errors. For example, it is simple to transpose digits or to misplace a decimal point when entering numerical data. Numerical data can be

range checked

– that is, upper and lower limits can be specified for each variable. If a value lies outside this range then it is flagged up for further investigation.

Dates

– it is often difficult to check the accuracy of dates, although sometimes you may know that dates must fall within certain time periods. Dates can be checked to make sure that they are valid. For example, 30th February must be incorrect, as must any day of the month greater than 31, and any month greater than 12. Certain logical checks can also be applied. For example, a patient’s date of birth should correspond to his/her age, and patients should usually have been born before entering the study (at least in most studies). In addition, patients who have died should not appear for subsequent follow-up visits!

With all error checks, a value should only be corrected if there is evidence that a mistake has been made. You should not change values simply because they look unusual.

Handling missing data

There is always a chance that some data will be missing. If a large proportion of the data is missing, then the results are unlikely to be reliable. The reasons why data are missing should always be investigated – if missing data tend to cluster on a particular variable and/or in a particular subgroup of individuals, then it may indicate that the variable is not applicable or has never been measured for that group of individuals. If this is the case, it may be necessary to exclude that variable or group of individuals from the analysis. There are different types of missing data1:

Missing completely at random (MCAR)

– the missing values are truly randomly distributed in the data set and the fact that they are missing is unrelated to any study variable. The resulting parameter estimates are unlikely to be biased (

Chapter 34

). An example is when a patient fails to attend a hospital appointment because he is in a car accident.

Missing at random (MAR)

– the missing values of a variable do not depend on that variable but can be completely explained by non-missing values of one or more of the other variables. For example, suppose that individuals are asked to keep a diet diary if their BMI is above 30 kg/m

2

: the missing diet diary data are MAR because missingness is completely determined by BMI (those with a BMI below the cut-off do not complete the diet diary).

Missing not at random (MNAR)

– the chance that data on a particular variable are missing is strongly related to that variable. In this situation, our results may be severely biased For example, suppose we are interested in a measurement that reflects the health status of patients and this information is missing for some patients because they were not well enough to attend their clinic appointments: we are likely to get an overly optimistic overall view of the patients’ health if we take no account of the missing data in the analysis.

Provided the missing data are not MNAR, we may be able to estimate (impute1) the missing data2. A simple approach is to replace a missing observation by the mean of the existing observations for that variable or, if the data are longitudinal, by the last observed value. These are examples of single imputation. In multiple imputation, we create a number (generally up to five) of imputed data sets from the original data set, with the missing values replaced by imputed values which are derived from an appropriate model that incorporates random variation. We then use standard statistical procedures on each complete imputed data set and finally combine the results from these analyses. Alternative statistical approaches to dealing with missing data are available2, but the best option is to minimize the amount of missing data at the outset.

Outliers

What are outliers?

Outliers are observations that are distinct from the main body of the data, and are incompatible with the rest of the data. These values may be genuine observations from individuals with very extreme levels of the variable. However, they may also result from typing errors or the incorrect choice of units, and so any suspicious values should be checked. It is important to detect whether there are outliers in the data set, as they may have a considerable impact on the results from some types of analyses (Chapter 29).

For example, a woman who is 7 feet tall would probably appear as an outlier in most data sets. However, although this value is clearly very high, compared with the usual heights of women, it may be genuine and the woman may simply be very tall. In this case, you should investigate this value further, possibly checking other variables such as her age and weight, before making any decisions about the validity of the result. The value should only be changed if there really is evidence that it is incorrect.

Checking for outliers

A simple approach is to print the data and visually check them by eye. This is suitable if the number of observations is not too large and if the potential outlier is much lower or higher than the rest of the data. Range checking should also identify possible outliers. Alternatively, the data can be plotted in some way (Chapter 4) – outliers can be clearly identified on histograms and scatter plots (see also Chapter 29 for a discussion of outliers in regression analysis).

Handling outliers

It is important not to remove an individual from an analysis simply because his/her values are higher or lower than might be expected. However, the inclusion of outliers may affect the results when some statistical techniques are used. A simple approach is to repeat the analysis both including and excluding the value – this is a type of sensitivity analysis (Chapter 35). If the results are similar, then the outlier does not have a great influence on the result. However, if the results change drastically, it is important to use appropriate methods that are not affected by outliers to analyse the data. These include the use of transformations (Chapter 9) and non-parametric tests (Chapter 17).

Example

After entering the data described in Chapter 2, the data set is checked for errors (Fig. 3.1). Some of the inconsistencies highlighted are simple data entry errors. For example, the code of ‘41’ in the ‘Sex of baby’ column is incorrect as a result of the sex information being missing for patient 20; the rest of the data for patient 20 had been entered in the incorrect columns. Others (e.g. unusual values in the gestational age and weight columns) are likely to be errors, but the notes should be checked before any decision is made, as these may reflect genuine outliers. In this case, the gestational age of patient number 27 was 41 weeks, and it was decided that a weight of 11.19 kg was incorrect. As it was not possible to find the correct weight for this baby, the value was entered as missing.

References

Bland, M. (2015)

An Introduction to Medical Statistics

. 4th edition. Oxford University Press.

Horton, N.J. and Kleinman, K.P. (2007) Much ado about nothing: a comparison of missing data methods and software to fit incomplete data regression models.

American Statistician

, 61(1), 71–90.

4Displaying data diagrammatically

 

Learning objectives

By the end of this chapter, you should be able to:

Explain what is meant by a frequency distribution

Describe the shape of a frequency distribution

Describe the following diagrams: (segmented) bar or column chart, pie chart, histogram, dot plot, stem-and-leaf plot, box-and-whisker plot, scatter diagram

Explain how to identify outliers from a diagram in various circumstances

Describe the situations when it is appropriate to use connecting lines in a diagram

Relevant Workbook questions: MCQs 7, 8, 9, 37 and 50; and SQs 1 and 9 available online

One of the first things that you may wish to do when you have entered your data into a computer is to summarize them in some way so that you can get a ‘feel’ for the data. This can be done by producing diagrams, tables or summary statistics (Chapters 5 and 6). Diagrams are often powerful tools for conveying information about the data, for providing simple summary pictures, and for spotting outliers and trends before any formal analyses are performed.

One variable

Frequency distributions

An empirical frequency distribution of a variable relates each possible observation, class of observations (i.e. range of values) or category, as appropriate, to its observed frequency of occurrence. If we replace each frequency by a relative frequency (the percentage of the total frequency), we can compare frequency distributions in two or more groups of individuals.

Displaying frequency distributions

Once the frequencies (or relative frequencies) have been obtained for categorical or some discrete numerical data, these can be displayed visually.

Bar or column chart

– a separate horizontal or vertical bar is drawn for each category, its length being proportional to the frequency in that category. The bars are separated by small gaps to indicate that the data are categorical or discrete (

Fig. 4.1

a).

Pie chart

– a circular ‘pie’ is split into sectors, one for each category, so that the area of each sector is proportional to the frequency in that category (

Fig. 4.1

b).

It is often more difficult to display continuous numerical data, as the data may need to be summarized before being drawn. Commonly used diagrams include the following:

Histogram

– this is similar to a bar chart, but there should be no gaps between the bars as the data are continuous (

Fig. 4.1

d). The width of each bar of the histogram relates to a range of values for the variable. For example, the baby’s weight (

Fig. 4.1

d) may be categorized into 1.75–1.99 kg, 2.00–2.24 kg, …, 4.25–4.49 kg. The area of the bar is proportional to the frequency in that range. Therefore, if one of the groups covers a wider range than the others, its base will be wider and height shorter to compensate. Usually, between five and 20 groups are chosen; the ranges should be narrow enough to illustrate patterns in the data, but should not be so narrow that they are the raw data. The histogram should be labelled carefully to make it clear where the boundaries lie.

Dot plot

– each observation is represented by one dot on a horizontal (or vertical) line (

Fig. 4.1

e). This type of plot is very simple to draw, but can be cumbersome with large data sets. Often a summary measure of the data, such as the mean or median (

Chapter 5

), is shown on the diagram. This plot may also be used for discrete data.

Stem-and-leaf plot

– this is a mixture of a diagram and a table; it looks similar to a histogram turned on its side, and is effectively the data values written in increasing order of size. It is usually drawn with a vertical

stem

, consisting of the first few digits of the values, arranged in order. Protruding from this stem are the

leaves

– i.e. the final digit of each of the ordered values, which are written horizontally (

Fig. 4.2

) in increasing numerical order.

Box plot

(often called a

box-and-whisker plot

) – this is a vertical or horizontal rectangle, with the ends of the rectangle corresponding to the upper and lower quartiles of the data values (

Chapter 6

). A line drawn through the rectangle corresponds to the median value (

Chapter 5

). Whiskers, starting at the ends of the rectangle, usually indicate minimum and maximum values but sometimes relate to particular percentiles, e.g. the 5th and 95th percentiles (Fig. 6.1). Outliers may be marked.

The ‘shape’ of the frequency distribution

The choice of the most appropriate statistical method will often depend on the shape of the distribution. The distribution of the data is usually unimodal in that it has a single ‘peak’. Sometimes the distribution is bimodal (two peaks) or uniform (each value is equally likely and there are no peaks). When the distribution is unimodal, the main aim is to see where the majority of the data values lie, relative to the maximum and minimum values. In particular, it is important to assess whether the distribution is:

symmetrical

– centred around some mid-point, with one side being a mirror-image of the other (

Fig. 5.1

);

skewed to the right (positively skewed)

– a long tail to the right with one or a few high values. Such data are common in medical research (

Fig. 5.2

);

skewed to the left (negatively skewed)

– a long tail to the left with one or a few low values (

Fig. 4.1

d).

Two variables

If one variable is categorical, then separate diagrams showing the distribution of the second variable can be drawn for each of the categories. Other plots suitable for such data include clustered or segmented bar or column charts (Fig. 4.1c).

If both of the variables are numerical or ordinal, then the relationship between the two can be illustrated using a scatter diagram (Fig. 4.1f). This plots one variable against the other in a two-way diagram. One variable is usually termed the x variable and is represented on the horizontal axis. The second variable, known as the y variable, is plotted on the vertical axis.

Identifying outliers using graphical methods

We can often use single-variable data displays to identify outliers. For example, a very long tail on one side of a histogram may indicate an outlying value. However, outliers may sometimes only become apparent when considering the relationship between two variables. For example, a weight of 55 kg would not be unusual for a woman who was 1.6 m tall, but would be unusually low if the woman’s height was 1.9 m.

The use of connecting lines in diagrams

The use of connecting lines in scatter diagrams may be misleading. Connecting lines suggest that the values on the x-axis are ordered in some way – this might be the case if, for example, the x-axis reflects some measure of time or dose. Where this is not the case, the points should not be joined with a line. Conversely, if there is a dependency between different points (e.g. because they relate to results from the same individual at two different time points, such as before and after treatment), it is helpful to connect the relevant points by a straight line (Fig. 20.1) and important information may be lost if these lines are omitted.

5Describing data: the ‘average’

Learning objectives

By the end of this chapter, you should be able to:

Explain what is meant by an average

Describe the appropriate use of each of the following types of average: arithmetic mean, mode, median, geometric mean, weighted mean

Explain how to calculate each type of average

List the advantages and disadvantages of each type of average

Relevant Workbook questions: MCQs 1, 10, 11, 12, 13, 19 and 39; and SQs 2, 3, 4 and 9 available online

Summarizing data

It is very difficult to have any ‘feeling’ for a set of numerical measurements unless we can summarize the data in a meaningful way. A diagram (Chapter 4) is often a useful starting point. We can also condense the information by providing measures that describe the important characteristics of the data. In particular, if we have some perception of what constitutes a representative value, and if we know how widely scattered the observations are around it, then we can formulate an image of the data. The average is a general term for a measure of location; it describes a typical measurement. We devote this chapter to averages, the most common being the mean and median (Table 5.1). We introduce measures that describe the scatter or spread of the observations in Chapter 6.

Table 5.1 Advantages and disadvantages of averages.

Type of average

Advantages

Disadvantages

Mean

Uses all the data values

Algebraically defined and so mathematically manageable

Known sampling distribution (

Chapter 9

)

Distorted by outliers

Distorted by skewed data

Median

Not distorted by outliers

Not distorted by skewed data

Ignores most of the information

Not algebraically defined

Complicated sampling distribution

Mode

Easily determined for categorical data

Ignores most of the information

Not algebraically defined

Unknown sampling distribution

Geometric mean

Before back-transformation, it has the same advantages as the mean

Appropriate for right-skewed data

Only appropriate if the log transformation produces a symmetrical distribution

Weighted mean

Same advantages as the mean

Ascribes relative importance to each observation

Algebraically defined

Weights must be known or estimated

The arithmetic mean

The arithmetic mean, often simply called the mean, of a set of values is calculated by adding up all the values and dividing this sum by the number of values in the set.

It is useful to be able to summarize this verbal description by an algebraic formula. Using mathematical notation, we write our set of n observations of a variable, x, as x1, x2, x3, …, xn. For example, x might represent an individual’s height (cm), so that x1 represents the height of the first individual, and xi the height of the i