Statistics for Terrified Biologists - Helmut F. van Emden - E-Book

Statistics for Terrified Biologists E-Book

Helmut F. van Emden

4,3
35,99 €

-100%
Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.

Mehr erfahren.
Beschreibung

Makes mathematical and statistical analysis understandable to even the least math-minded biology student

This unique textbook aims to demystify statistical formulae for the average biology student. Written in a lively and engaging style, Statistics for Terrified Biologists, 2nd Edition draws on the author’s 30 years of lecturing experience to teach statistical methods to even the most guarded of biology students. It presents basic methods using straightforward, jargon-free language. Students are taught to use simple formulae and how to interpret what is being measured with each test and statistic, while at the same time learning to recognize overall patterns and guiding principles. Complemented by simple examples and useful case studies, this is an ideal statistics resource tool for undergraduate biology and environmental science students who lack confidence in their mathematical abilities. 

Statistics for Terrified Biologists presents readers with the basic foundations of parametric statistics, the t-test, analysis of variance, linear regression and chi-square, and guides them to important extensions of these techniques. It introduces them to non-parametric tests, and includes a checklist of non-parametric methods linked to their parametric counterparts. The book also provides many end-of-chapter summaries and additional exercises to help readers understand and practice what they’ve learned.

  • Presented in a clear and easy-to-understand style
  • Makes statistics tangible and enjoyable for even the most hesitant student
  • Features multiple formulas to facilitate comprehension
  • Written by of the foremost entomologists of his generation

This second edition of Statistics for Terrified Biologists is an invaluable guide that will be of great benefit to pre-health and biology undergraduate students.

Sie lesen das E-Book in den Legimi-Apps auf:

Android
iOS
von Legimi
zertifizierten E-Readern

Seitenzahl: 599

Veröffentlichungsjahr: 2019

Bewertungen
4,3 (16 Bewertungen)
10
0
6
0
0
Mehr Informationen
Mehr Informationen
Legimi prüft nicht, ob Rezensionen von Nutzern stammen, die den betreffenden Titel tatsächlich gekauft oder gelesen/gehört haben. Wir entfernen aber gefälschte Rezensionen.



Table of Contents

Cover

Preface to the second edition

Preface to the first edition

1 How to use this book

Introduction

The text of the chapters

What should you do if you run into trouble?

Elephants

The numerical examples in the text

Boxes

Spare‐time activities

Executive summaries

Why go to all that bother?

The bibliography

2 Introduction

What are statistics?

Notation

Notation for calculating the mean

3 Summarising variation

Introduction

Different summaries of variation

Why

n

 − 1?

Why are the deviations squared?

The standard deviation

The next chapter

4 When are sums of squares NOT sums of squares?

Introduction

Calculating machines offer a quicker method of calculating the sum of squares

Avoid being confused by the term

sum of squares

Summary of the calculator method for calculations as far as the standard deviation

5 The normal distribution

Introduction

Frequency distributions

The normal distribution

What percentage is a standard deviation worth?

Are the percentages always the same as these?

Other similar scales in everyday life

The standard deviation as an estimate of the frequency of a number occurring in a sample

From percentage to probability

EXECUTIVE SUMMARY 1 The standard deviation

6 The relevance of the normal distribution to biological data

To recap

Is our observed distribution normal?

What can we do about a distribution that clearly is not normal?

How many samples are needed?

7 Further calculations from the normal distribution

Introduction

Is A bigger than B?

The yardstick for deciding

Derivation of the standard error of a difference between two means

The importance of the standard error of differences between means

Summary of this chapter

EXECUTIVE SUMMARY 2 Standard error of a difference between two means

8 The

t

‐test

Introduction

The principle of the

t

‐test

The

t

‐test in statistical terms

Why

t

?

Tables of the

t

‐distribution

The standard

t

‐test

t

‐test for means associated with unequal variances

The paired

t

‐test

EXECUTIVE SUMMARY 3 The

t

‐test

9 One tail or two?

Introduction

Why is the analysis of variance

F

‐test one‐tailed?

The two‐tailed

F

‐test

How many tails has the

t

‐test?

The final conclusion on number of tails

10 Analysis of variance (ANOVA): what is it? How does it work?

Introduction

Sums of squares in ANOVA

Some ‘made‐up’ variation to analyse by ANOVA

The sum of squares table

Using ANOVA to sort out the variation in Table C

The relationship between

t

and

F

Constraints on ANOVA

Comparison between treatment means in ANOVA

The least significant difference

A caveat about using the LSD

EXECUTIVE SUMMARY 4 The principle of ANOVA

11 Experimental designs for analysis of variance (ANOVA)

Introduction

Fully randomised

Randomised blocks

Incomplete blocks

Latin square

Split plot

Types of analysis of variance

EXECUTIVE SUMMARY 5 Analysis of a one‐way randomised block experiment

12 Introduction to factorial experiments

What is a factorial experiment?

Interaction: what does it mean biologically?

How does a factorial experiment change the form of the analysis of variance?

Sums of squares for interactions

13 2‐Factor factorial experiments

Introduction

An example of a 2‐factor experiment

Analysis of the 2‐factor experiment

Two important things to remember about factorials before tackling the next chapter

Analysis of factorial experiments with unequal replication

EXECUTIVE SUMMARY 6 Analysis of a 2‐factor randomised block experiment

14 Factorial experiments with more than two factors – leave this out if you wish!

Introduction

Different ‘orders’ of interaction

Example of a 4‐factor experiment

15 Factorial experiments with split plots

Introduction

Deriving the split plot design from the randomised block design

Degrees of freedom in a split plot analysis

Numerical example of a split plot experiment and its analysis

Comparison of split plot and randomised block experiments

Uses of split plot designs

16 The

t

‐test in the analysis of variance

Introduction

Brief recap of relevant earlier sections of this book

Least significant difference test

Multiple range tests

Testing differences between means

Presentation of the results of tests of differences between means

The results of the experiments analysed by analysis of variance in Chapters 11–15

Some final advice

17 Linear regression and correlation

Introduction

Cause and effect

Other traps waiting for you to fall into

Regression

Independent and dependent variables

The regression coefficient (

b

)

Calculating the regression coefficient (

b

)

The regression equation

A worked example on some real data

Correlation

Extensions of regression analysis

EXECUTIVE SUMMARY Linear regression

18 Analysis of covariance (ANCOVA)

Introduction

A worked example of ANCOVA

Executive Summary 8 Analysis of covariance (ANCOVA)

19 Chi‐square tests

Introduction

When not and where not to use χ

2

The problem of low frequencies

Yates' correction for continuity

The χ

2

test for

goodness of fit

Association (or contingency) χ

2

20 Nonparametric methods (what are they?)

Disclaimer

Introduction

Advantages and disadvantages of parametric and nonparametric methods

Some ways data are organised for nonparametric tests

The main nonparametric methods that are available

Appendix A: How many replicates?

Introduction

Underlying concepts

‘Cheap and cheerful’ calculation of number of replicates needed

More accurate calculation of number of replicates needed

How to prove a negative

Appendix B: Statistical tables

Appendix C: Solutions to spare‐time activities

Chapter 3

Chapter 4

Chapter 7

Chapter 8

Chapter 11

Chapter 13

Chapter 14

Chapter 15

Chapter 16

Chapter 17

Chapter 18

Chapter 19

The Clues (See ‘Spare‐time Activity’ to Chapter , p. 261)

Appendix D: Bibliography

Introduction

The Internet

Index

End User License Agreement

List of Tables

Chapter 10

Table A

Table B

Table C

Table D

Table E

Table F

Table G

Table H

Table I

Table J

Chapter 15

A Varieties A, B, C; S, sprayed; U, unsprayed.

B

C

Chapter 18

Table 18.1

Table 18.2

Table 18.3

List of Illustrations

Chapter 3

Fig. 3.1 An unusual ruler! The familiar scale increasing left to right has...

Chapter 5

Fig. 5.1 An example of the normal distribution: the frequencies in which 1...

Fig. 5.2 The normal distribution curve of Figure 5.1 with the egg weight s...

Fig. 5.3 Demonstration that the areas under the normal curve containing th...

Fig. 5.4 The unifying concept of the scale of standard deviations for two ...

Fig. 5.5 A scale of percentage observations included by different standard...

Chapter 6

Fig . 6.1 The frequency of occurrence of larvae of the frit fly (

Oscinella

...

Fig. 6.2 Results of calculating the standard deviation of repeat samples o...

Fig. 6.3 The frequency distribution of frit fly larvae (see Figure 6.1) wi...

Chapter 7

Fig. 7.1 The unifying concept of the scale of standard deviations/errors f...

Fig. 7.2 Figure 7.1 redrawn so that the same scale is used for weight, whe...

Fig. 7.3 Random sampling of the same population of numbers gives a much sm...

Fig. 7.4

Differences between two means

illustrated, with the notation for ...

Fig. 7.5 Schematic representation of how the standard error of difference ...

Fig. 7.6 Figure 7.1 of the normal curve for egg weights (showing the propo...

Chapter 8

Fig. 8.1 Improved estimate of true variance (dotted horizontal lines) with...

Fig. 8.2 Confidence limits of 95% for observations of the most variable di...

Fig. 8.3 Summary of the procedure involved in the basic

t

‐test.

Chapter 9

Fig . 9.1 Areas under the normal curve appropriate to probabilities of 20%...

Chapter 10

Fig. 10.1

F

values for four treatments (

n

 − 1 = 3), graphed from tables of...

Fig. 10.2 The comparison of only seven numbers already involves 21 tests!...

Chapter 11

Fig . 11.1

Fully randomised design

: an example of how 4 repeats of 3 treat...

Fig. 11.2

Fully randomised design

: the layout of Figure 11.1 with data add...

Fig. 11.3

Randomised block design

: a possible randomisation of three treat...

Fig. 11.4

Randomised block design

: how not to do it! Four blocks of 10 tre...

Fig. 11.5

Randomised block design

: the layout of Figure 11.3 with data add...

Fig. 11.6

Incomplete randomised block design

: a possible layout of seven t...

Fig. 11.7

Lattice design of incomplete blocks

: here pairs in a block (e.g....

Fig. 11.8

Latin square design

: (a) a possible layout for four treatments; ...

Fig. 11.9

Latin square design

: the layout of Figure 11.8a with data added....

Fig. 11.10

Latin square designs

: a possible randomisation of four treatmen...

Chapter 12

Fig . 12.1 The same six factorial combinations of two factors laid out as ...

Fig. 12.2 Time in seconds taken by a walker, a bike, and a car to cover 50...

Chapter 13

Fig . 13.1 A 2‐factor randomised block experiment with three blocks applyi...

Chapter 15

Fig . 15.1 Relationship between a randomised and a split plot design: (a) ...

Fig. 15.2 Yield of sprouts (kg/plot) from the experimental design shown in...

Chapter 17

Fig . 17.1 Relationship between population density in an area and the annu...

Fig. 17.2 Decline in deaths per 1000 people from malaria following the int...

Fig. 17.3 (a) The plot of increasing per cent kill of insects against incr...

Fig. 17.4 The triumph of statistics over common sense. Each treatment mean...

Fig. 17.5 (a) The data points would seem to warrant the conclusion that la...

Fig. 17.6 Data which fail to meet the criterion for regression or correlat...

Fig. 17.7 The importance of allocating the related variables to the correc...

Fig. 17.8 The characteristics of the regression (

b

) and correlation (

r

) co...

Fig. 17.9 Diagrammatic representation of a regression line as a plank laid...

Fig. 17.10 (a) Six data points for which one might wish to calculate a reg...

Fig. 17.11 The distances of each point in Figure 17.10b from the mean of t...

Fig. 17.12 Eight points each with a

b

of + or − 2. Including all points in...

Fig. 17.13 The calculated regression line added to Figure 17.10a.

Fig. 17.14 (a) A regression line has an intercept (

a

) value on the vertica...

Fig. 17.15 The calculated regression line for the data points relating the...

Fig. 17.16 The distances for which sums of squares are calculated in the r...

Fig. 17.17 (a) The correlation between length (

y

) and width (

x

) of

Azalea

...

Fig. 17.18 Examples of nonlinear regression lines together with the form o...

Chapter 18

Fig . 18.1 The relationship between regression analysis and analysis of co...

Fig. 18.2 Regression lines for cholesterol level on age relevant to each d...

Fig. 18.3 (a) The regression of cholesterol level on age. (b) Another data...

Fig. 18.4 An enlarged portion of Figure 18.3 showing the effect of adjusti...

Fig. 18.5 The relationship between regression analysis and analysis of cov...

Guide

Cover

Table of Contents

Begin Reading

Pages

xv

xvi

xvii

xviii

xix

1

2

3

4

5

6

7

8

9

10

11

12

13

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

60

61

62

63

64

65

66

67

68

59

69

70

71

72

73

74

75

76

77

78

79

80

81

82

83

84

85

86

87

88

89

90

91

92

93

94

95

95

96

97

98

99

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125

126

127

128

129

130

131

132

133

134

135

136

137

138

139

140

141

142

143

144

145

146

147

148

150

149

151

152

153

154

155

156

157

158

159

160

161

162

163

164

165

166

167

168

170

171

172

173

174

169

175

176

177

178

179

180

181

183

182

184

185

186

187

188

189

190

191

191

192

193

194

195

196

197

198

199

200

201

202

203

204

205

206

207

208

209

210

211

212

213

214

215

216

217

218

219

220

221

222

223

224

225

226

227

228

229

230

231

232

233

234

235

236

237

238

239

240

241

242

243

244

245

246

247

248

249

250

251

252

253

254

255

256

257

258

259

260

261

262

263

263

264

265

266

267

268

269

270

271

272

273

274

275

276

277

278

279

280

281

282

283

284

285

286

287

288

289

290

291

292

293

294

295

296

297

298

299

300

301

302

303

304

305

306

307

308

309

310

311

312

313

314

315

316

317

318

319

320

321

322

323

324

325

326

327

328

329

330

331

332

333

334

335

336

337

338

339

340

341

342

343

344

345

346

345

346

347

348

349

350

351

352

353

354

355

355

356

357

358

359

360

361

362

363

364

373

374

375

376

377

378

379

380

381

382

383

384

385

386

387

388

389

390

391

392

393

393

394

395

396

397

398

399

400

401

402

403

404

Statistics for Terrified Biologists

Second Edition

Helmut F. van Emden

Emeritus Professor of Horticulture, School of Biological Sciences, The University of Reading, UK

Copyright

This second edition first published 2019

© 2019 John Wiley & Sons Ltd

Edition History

Blackwell. (1e 2008)

All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, recording or otherwise, except as permitted by law. Advice on how to obtain permission to reuse material from this title is available at http://www.wiley.com/go/permissions.

The right of Helmut F. van Emden to be identified as the author of this work has been asserted in accordance with law.

Registered Office

John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, USA

John Wiley & Sons Ltd, The Atrium, Southern Gate, Chichester, West Sussex, PO19 8SQ, UK

Editorial Office

9600 Garsington Road, Oxford, OX4 2DQ, UK

For details of our global editorial offices, customer services, and more information about Wiley products visit us at www.wiley.com.

Wiley also publishes its books in a variety of electronic formats and by print‐on‐demand. Some content that appears in standard print versions of this book may not be available in other formats.

Limit of Liability/Disclaimer of Warranty

While the publisher and authors have used their best efforts in preparing this work, they make no representations or warranties with respect to the accuracy or completeness of the contents of this work and specifically disclaim all warranties, including without limitation any implied warranties of merchantability or fitness for a particular purpose. No warranty may be created or extended by sales representatives, written sales materials or promotional statements for this work. The fact that an organization, website, or product is referred to in this work as a citation and/or potential source of further information does not mean that the publisher and authors endorse the information or services the organization, website, or product may provide or recommendations it may make. This work is sold with the understanding that the publisher is not engaged in rendering professional services. The advice and strategies contained herein may not be suitable for your situation. You should consult with a specialist where appropriate. Further, readers should be aware that websites listed in this work may have changed or disappeared between when this work was written and when it is read. Neither the publisher nor authors shall be liable for any loss of profit or any other commercial damages, including but not limited to special, incidental, consequential, or other damages.

Library of Congress Cataloging‐in‐Publication Data

Names: van Emden, H. F. (Helmut Fritz), author.

Title: Statistics for terrified biologists / Helmut F. van Emden (emeritus

 professor of horticulture, School of Biological Sciences, The University

 of Reading, UK).

Description: 2nd edition. | Hoboken, NJ : Wiley, 2019. | Includes

 bibliographical references and index. |

Identifiers: LCCN 2019001708 (print) | LCCN 2019002653 (ebook) | ISBN

 9781119563693 (Adobe PDF) | ISBN 9781119563686 (ePub) | ISBN 9781119563679

 (pbk.)

Subjects: LCSH: Biometry–Textbooks. | Statistics–Textbooks.

Classification: LCC QH323.5 (ebook) | LCC QH323.5 .V33 2019 (print) | DDC

 570.1/5195–dc23

LC record available at https://lccn.loc.gov/2019001708

Cover Design: Wiley

Cover Image: © gulfu photography/Getty Images

Preface to the second edition

I have been astounded by the positive reception the first edition of my little book has received. It has clearly filled a need; students from many different countries have emailed me just to say ‘Thank you for writing the book.’ Student reviews on Amazon have given my book high praise, and colleagues teaching statistics to biologists have also found it of value. One review complained about the ‘lack of maths’. I take this as a compliment, and the review arguing that the availability of computer packages makes it unnecessary to be able to carry out the calculations ‘long‐hand’ seems to me to completely miss the point!

But I have to hang my head in shame at the numerous errors there were in the first printing of 2008. Particularly the numerical examples were littered with computational errors (not infrequently in the form of dyslexic transpositions). The book went through several iterations before going to press, and I really should have rechecked all the calculations in the final proof. I can only apologise. However, these many errors led to a very encouraging outcome. The errors were largely identified by users, who then contacted me convinced they had found an error. The statistics teaching I received never gave me similar confidence. That students were sure they were right and that I had made an error is almost the best evidence of the success of my book that I could have asked for.

So what's new in this second edition? The statistics that my book tries to teach are the concepts elaborated by R.A. Fisher in the 1930s, and today these remain the basis of the statistical procedures based on the normal distribution. I can tweak the English, change sections to make them more easily understood, and add some further extensions of Fisher's concepts – and all this I have done. I have been surprised how I, as a reader of the text of the first edition, have found sentences completely incomprehensible that were written by me as author 10 years ago! I have used the opportunity of the second edition to try to clarify these passages for both you and me! I have greatly revised and, I hope, simplified the chapter on linear regression and correlation and added some material to the chapter on chi‐square tests. However, the most significant addition has been that analysis of covariance, just briefly mentioned in the first edition, has been allocated a chapter of its own. It deserves emphasis as a valuable statistical technique, and I find most textbooks shroud it in mystery. Although I would advise you to use computer programs if you want to use analysis of covariance, I thought it a good idea to demystify it as much as I can. The calculation of numbers in analysis of covariance is usually presented as having no connection with more familiar calculations, yet they come from the standard techniques of analysis of variance and regression. The point is that only some of the resulting values are then used while the rest are discarded, but doing the analyses that include these redundant numbers does make the whole thing much more comprehensible.

I can only hope that this second edition proves as popular as the first.

Helmut F. van Emden

July 2018

Preface to the first edition

I have written/edited several books on my own speciality, agricultural and horticultural entomology, but always at the request of a publisher or colleagues. This book is different in two ways. Firstly, it is a book I have positively wanted to write for many years and secondly, I am stepping out of my ‘comfort zone’ in doing so.

The origins of this book stem from my appointment to the Horticulture Department of Reading University under Professor O. V. S. Heath, FRS. Professor Heath appreciated the importance of statistics and, at a time when there was no University‐wide provision of statistics teaching, he taught the final year students himself. I became the ‘assistant’ whose role it was to run the practical exercises which followed the Professor's lecture. You cannot teach what you do not understand yourself, but I tried nonetheless.

Eventually I took over the entire course. By then it was taught in the second year and in the third year the students went on to take a Faculty‐wide course. I did not continue the lectures; the whole course was in the laboratory where I led the students (using pocket calculators) through the calculations in stages. The laboratory class environment involved continuous interaction with students in a way totally different from what happens in lectures, and it rapidly became clear to me that many biologists have their neurons wired up in a way that makes the traditional way of teaching statistics rather difficult for them.

What my students needed was confidence – confidence that statistical ideas and methods were not just theory, but actually worked with real biological data and, above all, had some basis in logic! As the years of teaching went on, I began to realise that the students regularly found the same steps a barrier to progress and a damage to their confidence. Year after year I tried new ways to help them over these ‘crisis points’; eventually I succeeded with all of them, I am told.

The efficacy of my unusual teaching aids can actually be quantified. After the Faculty course taught by professional statisticians, my students were formally examined together with cohorts of students from other departments in the Faculty (then of ‘Agriculture and Food’) who had attended the same third year course in Applied Statistics. My students mostly (perhaps all but three per year out of some 20) appeared in the mark list as a distinct block right at the upper end, with percentage marks in the 70s, 80s and even 90s. Although there may have also been one or two from other courses with high marks, there then tended to be a gap till marks in the lower 60s appeared and began a continuum down to single figures.

I therefore feel confident that this book will be helpful to biologists with its mnemonics such as SqADS and ‘you go along the corridor before you go upstairs’ Other things previously unheard of are the ‘lead line’ and ‘supertotals’ with their ‘subscripts’ – yet all have been appreciated as most helpful by my students over the years. A riffle through the pages will amaze – where are the equations and algebraic symbols? They have to a large extent been replaced by numbers and words. The biologists I taught – and I don't think they were atypical – could work out what to do with a ‘45’, but rarely what to do with an ‘x’. Also, I have found that there are a number of statistical principles students easily forget, and then inevitably run into trouble with their calculations. These concepts are marked with the symbol of a small elephant .

The book limits itself to the basic foundations of parametric statistics, the t‐test, analysis of variance, linear regression and chi‐square. However, the reader is guided as to where there are important extensions of these techniques, and there is an introduction to non‐parametric tests which includes a check list of non‐parametric methods linked to their parametric counterparts. Many chapters end with an ‘executive summary’ as a quick source for revision, and there are additional exercises to give the practice which is so essential to learning.

In order to minimise algebra, the calculations are explained with numerical examples. These, as well as the ‘spare‐time activity’ exercises have come from many sources, and I regret the origin of many has become lost in the mists of time. Quite a number come from experiments carried out by Horticulture students at Reading as part of their second year outdoor practicals, and others have been totally fabricated in order to ‘work out’ well. Others have had numbers or treatments changed better to fit what was needed. I can only apologise to anyone whose data I have used without due acknowledgement; failure to do so is not intentional. But please remember that data have often been fabricated or massaged – therefore do not rely on the results as scientific evidence for what they appear to show!

Today, computer programmes take most of the sweat out of statistical procedures, and most biologists have access to professional statisticians. ‘Why bother to learn basic statistics?’ is therefore a perfectly fair question, akin to ‘Why keep a dog and bark?’ The question deserves an answer; to save repetition, my answer can be found towards the end of Chapter 1.

I am immensely grateful to the generations of Reading students who have challenged me to overcome their ‘hang‐ups’ and who have therefore contributed substantially to any success this book achieves. Also many postgraduate students as well as experienced visiting overseas scientists have encouraged me to turn my course into book form. My love and special thanks go to my wife Gillian who, with her own experience of biological statistics, has supported and encouraged me in writing this book; it is to her that I owe the imaginative title for the book.

Finally, I should like to thank Ward Cooper of Blackwells for having faith in this biologist, who is less terrified of statistics than he once was.

Helmut F. van Emden

December 2006

1How to use this book

Chapter features

Introduction

The text of the chapters

What should you do if you run into trouble?

Elephants

The numerical examples in the text

Boxes

Spare–time activities

Executive summaries

Why go to all that bother?

The bibliography

Introduction

Don't be misled! This book cannot replace effort on your part. All it can aspire to do is to make that effort effective. The detective thriller only succeeds because you have read it too fast and not really concentrated – with that approach, you'll find this book just as mysterious.

In fact, you may not get very far if you just read this book at any speed! You will only succeed if you interact with the text, and how you might do this is the topic of most of this chapter.

The text of the chapters

The chapters, particularly 2–8, develop a train of thought essential to the subject of analysing biological data. You just have to take these chapters in order and quite slowly. There is only one way I know for you to maintain the concentration necessary to comprehension, and that is for you to make your own summary notes as you go along.

My Head of Department when I first joined the staff at Reading used to define a university lecture as ‘a technique for transferring information from a piece of paper in front of the lecturer to a piece of paper in front of the student, without passing through the heads of either’. That's why I stress making your own summary notes. You will retain very little by just reading the text; you'll find that after a while you've been thinking about something totally different while ‘reading’ several pages – we've all been there! The message you should take from my Head of Department's quote above is that just repeating in your writing what you are reading is little better than taking no notes at all: the secret is to digest what you have read and reproduce it in your own words and in summary form. Use plenty of headings and subheadings, boxes linked by arrows, cartoon drawings, etc. Another suggestion is to use different coloured pens for different recurring statistics, such as variance and correction factor. In fact, use anything that forces you to convert my text into as different a form as possible from the original; that will force you to concentrate, to involve your brain and to make it clear to you whether or not you have really understood that bit in the book so that it is safe to move on.

The actual process of making the notes is the critical step – you can throw the notes away at a later stage if you wish, though there's no harm in keeping them for a time for revision and reference.

So DON'T MOVE ON until you are ready. You'll only undo the value of previous effort if you persuade yourself that you are ready to move on when in your heart of hearts you know you are fooling yourself!

A key point in the book is Figure 7.5 on p. 64. Take real care to lay an especially good foundation up to there. If you really feel at home with this diagram, it is a sure sign that you have conquered any hang‐ups and are no longer a ‘terrified biologist’.

What should you do if you run into trouble?

The obvious first step is to go back to the point in the book where you last felt confident, and start again from there.

However, it often helps to see how someone else has explained the same topic, so it's a good idea to have a look at the relevant pages of a different statistics text (see Appendix D for some suggestions). You could also look up the topic on the Internet, where many statisticians have put articles and their lectures to students.

A third possibility is to see if someone can explain things to you face to face. Do you know or have access to someone who might be able to help? If you are at university, it could be a fellow student or even one of the staff. The person who tried to teach statistics to my class at university failed completely as far as I was concerned, but later on I found he could explain things to me quite brilliantly in a one‐to‐one situation.

Elephants

At certain points in the text you will find the sign of the elephant, i.e. .

They say ‘elephants never forget’ and the symbol means just that: NEVER FORGET! I have used it to mark some key statistical concepts which, in my experience, people easily forget and as a result run into trouble later on and find it hard to see where they have gone wrong. So, take it from me that it is really well worth making sure these matters are firmly embedded in your memory.

The numerical examples in the text

As stated in the Preface to the First Edition, I soon learnt that biologists don't like x. For some reason they prefer a real number but are more prepared to accept, say, 45 as representing any number than they are an x! Therefore, in order to avoid ‘algebra’ as far as possible, I have used actual numbers to illustrate the working of statistical analyses and tests. You probably won't gain a lot by keeping up with me on a hand calculator as I describe the different steps of a calculation, but you should make sure at each step that you understand where each number in a calculation has come from and why it has been included in that way.

When you reach the end of each worked analysis or test, however, you should go back to the original source of the data in the book and try to rework on a hand calculator the calculations which follow from just those original data. Try not to look up later stages in the calculations unless you are irrevocably stuck, and then use the executive summary (if there is one at the end of the chapter) rather than the main text.

Boxes

There will be a lot of individual variation among readers of this book in the knowledge and experience of statistics they have gained in the past, and in their ability to grasp and retain statistical concepts. At certain points, therefore, some will be happy to move on without any further explanation from me or any further repetition of calculation procedures.

For those less happy to take things for granted at such points, I have placed the material and calculations they are likely to find helpful in boxes in order not to hold back or irritate the others. Calculations in the boxes may prove particularly helpful if, as suggested above, you are reworking a numerical example from the text and need to refer to a box to find out why you are stuck or perhaps where you went wrong.

Spare‐time activities

These are numerical exercises you should be equipped to complete by the time you reach them at the end of several of the chapters.

That is the time to stop and do them. Unlike the within‐chapter numerical examples, you should feel quite free to use any material in previous chapters or executive summaries to remind you of the procedures involved and guide you through them. Use a hand calculator and remember to write down the results of intermediate calculations. This will make it much easier for you to detect where you went wrong if your answers do not match the solution to that exercise given in Appendix C. Do read the beginning of that appendix early on: it explains that you should not worry or waste time recalculating if your numbers are similar, even if they are not identical. I can assure you, you will recognise – when you compare your figures with the ‘solution’ – if you have followed the statistical steps of the exercise correctly; you will also immediately recognise if you have not.

Doing these exercises conscientiously with a hand calculator or spreadsheet, and when you reach them in the book rather than much later, is really important. They are the best things in the book for impressing the subject into your long‐term memory and for giving you confidence that you understand what you are doing.

The authors of most other statistics books recognise this and also include exercises. If you're willing, I would encourage you to gain more confidence and experience by going on to try the methods as described in this book on their exercises.

By the way, a blank spreadsheet such as Excel makes a grand substitute for a hand calculator, with the added advantage that repeat calculations (e.g. squaring numbers) can be copied and pasted from the first number to all the others.

Executive summaries

Certain chapters end with such a summary, which aims to condense the meat of the chapter into little over a page or so. The summaries provide a condensed reference source for the calculations scattered throughout the previous chapter, with hopefully enough explanatory wording to jog your memory about how the calculations were derived. They will therefore prove useful when you tackle the spare‐time activities.

Why go to all that bother?

You might ask (and some of the reviews of the first edition did): why teach how to do statistical analyses on a hand calculator when we can type the data into a computer program and get all the calculations done automatically? It might have been useful once, but now…?

Well, I can assure you that you wouldn't ask that question if you had examined as many project reports and theses as I have, and seen the consequences of just ‘typing data into a computer program’. No, it does help to avoid trouble if you understand what the computer should be doing.

Why do I say that?

Planning experiments is made much more effective if you understand the advantages and disadvantages of different experimental designs and how they affect the

experimental error

against which we test our differences between treatments. It probably won't mean much to you now, but you really do need to understand how experimental design as well as treatment and replicate numbers impact the

residual degrees of freedom

and whether you should be looking at one‐tailed or two‐tailed statistical tables. My advice to my students has always been that, before embarking on an experiment, they should draw up a blank form on which to enter the results, then invent some results and complete the appropriate analysis on them. It can often cause you to think again.

Although the computer can carry out your calculations for you, it has the terminal drawback that it will accept the numbers you type in without challenging you as to whether what you are asking it to do with them is sensible. Thus 

 and again at this stage you'll have to accept my word that these are critical issues 

 no window will appear on the screen that says: ‘Whoa! You should be analysing these numbers non‐parametrically’ or ‘No problem. I can do an ordinary factorial analysis of variance, but you seem to have forgotten you actually used a split‐plot design’ or ‘These numbers are clearly pairs; why don't you exploit the advantages of pairing in the

t

‐test that you've told me to do?’ or ‘I'm surprised you are asking for the statistics for drawing a straight line through the points on this obvious hollow curve.’ I could go on.

You will no doubt use computer programs rather than a hand calculator for your statistical calculations in the future. But the printouts from these programs are often not particularly user‐friendly. They usually assume some knowledge of the internal structure of the analysis the computer has carried out, and abbreviations identify the numbers printed out. So obviously an understanding of what your computer program is doing and familiarity with statistical terminology can only be a help.

A really important value you will gain from this book is confidence that statistical methods are not a ‘black box’ somewhere inside a computer, but that you could

in extremis

(and with this book at your side) carry out the analyses and tests on the back of an envelope with a hand calculator. Also, once you have become content that the methods covered in this book are concepts you understand, you will probably be happier using the relevant computer programs.

More than that, you will probably be happy to expand the methods you use to ones I have not covered, on the basis that they are likely also to be logical, sensible, and understandable routes to passing satisfactory judgements on biological data. Expansions of the methods included in this book (e.g. those mentioned at the end of

Chapter 17

) will require you to use numbers produced by the calculations I have covered. You should be able confidently to identify which these are.

You will probably find yourself discussing your proposed experiment and later the appropriate analysis with a professional statistician. It does so help to speak the same language! Additionally, the statistician will be of much more help to you if you are competent to see where the latter has missed a constraint to the statistical advice given arising from biological realities.

Finally, there is the intellectual satisfaction of mastering a subject which can come hard to biologists. Unfortunately, you won't appreciate it was worth doing until you view the effort from the hindsight of having succeeded. I assure you the reward is real. I can still remember vividly the occasion many years ago when, in the middle of teaching an undergraduate statistics class, I realised how simple the basic idea behind the analysis of variance was, and how this extraordinary simplicity was only obfuscated for a biologist by the short‐cut calculation methods used. In other words, I was in a position to write

Chapter 10

. Later, the gulf between most biologists and trained statisticians was really brought home to me by one of the latter's comments on an early version of this book: ‘I suggest

Chapter 10

should be deleted; it's not the way we do it.’ I rest my case!

The bibliography

Right at the back of this book is a short list of other statistics books. Very many such books have been written, and I only have personal experience of a small selection. Some of these I have found particularly helpful, either to increase my comprehension of statistics (much needed at times!) or to find details of and recipes for more advanced statistical methods. I must emphasise that I have not even seen a majority of the books that have been published and that the ones that have helped me most may not be the ones that would be of most help to you. Omission of a title from my list implies absolutely no criticism of that book, and – if you see it in the library – do look at it carefully: it could well be the best book for you.

2Introduction

Chapter features

What are statistics?

Notation

Notation for calculating the mean

What are statistics?

Statistics are summaries or collections of numbers. If you say ‘the tallest person among my friends is 173 cm tall’, that is a statistic. It is a number based on a scrutiny of lots of different numbers – the different heights of all your friends, but reporting just the largest number.

If you say ‘the average height of my friends is 158 cm’ – then that is another statistic. It again requires you to have collected the different heights of all your friends, but this time your statistic, the average height, is a summary statistic. It has summarised all the data you collected into a single number.

If you have lots and lots and lots of friends, it may not be practical to measure them all, but you can probably get a good estimate of the average height by measuring not all of them but a large sample, and calculating the average of the sample. Now the average of your sample, particularly if it's not very big, may not be identical to the true average of all your friends. This brings us to a key principle of statistics. We are usually trying to evaluate a parameter (from the Greek for ‘beyond measurement’) by making an estimate from a sample of a practical size. So we must always distinguish parameters and estimates. One example: in statistics we use the word mean for the estimate (from a sample of numbers) of something we can rarely measure; the parameter we call the average (of the entire existing population of numbers).

Notation

‘Add together all the numbers in the sample, and divide by the number of numbers’. That's how we actually calculate a mean, isn't it? So even that very simple statistic takes 15 words to describe as a procedure. Things can get much more complicated (see Box 2.1).

Box 2.1

Multiply every number in the first column by its partner in the second column, and add these products together. Now subtract from this sum the total of the numbers in the first column multiplied by the total of the numbers in the second column, but first divide this product of the totals by the number of pairs of numbers. Now square the answer. Divide this by a divisor obtained as follows: for each column of numbers separately, square and add the numbers and subtract the square of the total of the column after dividing this square by the number of numbers. Then add the results for the two columns together.

We really have to find a shorthand way of expressing statistical computations, and this shorthand is called notation. The off‐putting thing about notation for biologists is that it tends to be algebraic in character. Also there is no universally accepted notation, and the variations between different textbooks are naturally pretty confusing to the beginner!

What is perhaps worse is a purely psychological problem for most biologists: your worry level has perhaps already risen at the very mention of algebra! Confront a biologist with an x instead of a number like 57 and there is a tendency to switch off the receptive centres of the brain altogether. Yet most statistical calculations involve nothing more terrifying than addition, subtraction, multiplication, and division – though I must admit you will also have to square numbers and find square roots. All this can now be done with the cheapest hand calculators.

Most of you now own or have access to a computer, where you only have to type the sampled numbers into a spreadsheet or other program and the machine has all the calculations that have to be done already programmed. So do computers remove the need to understand what their programs are doing? I don't think so! I discussed all this more fully in Chapter 1, but repeat it here in case you have skipped that chapter. Briefly, you need to know what programs are right for what sort of data, and what the limitations are. So an understanding of data analysis will enable you to plan more effective experiments. Remember that the computer will be quite content to process your figures perfectly inappropriately, if that is what you request! It may also be helpful to know how to interpret the final printout… correctly.

Back to the subject of notation. As I have just pointed out, we are going to be involved in quite unsophisticated number‐crunching and the whole point of notation is to remind us of the order in which we do this. Note I say ‘remind us’ and not ‘tell us’. It would be quite dangerous, when you start out, to think that notation enables you to do the calculations without previous experience. You just can't turn to p. 257 of a statistics book and expect to tackle something like:

without sufficient homework on the notation for it to remind you of what you have previously learnt! Incidentally, Box 2.1 translates this algebraic notation into English for two columns of paired numbers (values of x and y).

As pointed out earlier, the formula above gives you the order of using the three component in the calculation, i.e. you add together the two components under the line and then divide the top component by this sum. But the formula doesn't tell you how to calculate the three components. For this you have to identify the three components in English as:

These terms will probably mean nothing to you at this stage, but being able to calculate the sum of squares of a set of numbers is just about as common a procedure as calculating the mean.

We frequently have to cope with notation elsewhere in life. Recognise 03.11.92? It's a date, perhaps a ‘date of birth’. The Americans use a different notation; they would write the same birthday as 11.03.92. And do you recognise Cm? You probably do if you are a musician: it's notation for playing together the three notes C, Eb, and G – the chord of C minor (hence Cm).

In this book, the early chapters include notation to help you remember what statistics (i.e. summaries of data) such as sums of squares are and how they are calculated. However, as soon as possible, we will be using keywords such as sums of squares to replace blocks of algebraic notation. This should make the pages less frightening and make the text flow better. After all, you can always go back to an earlier chapter if you need reminding of the notation and the calculation it represents. I guess it's a bit like cooking! The first half‐dozen times you want to make pancakes you need the cookbook to provide the information that 300 ml milk goes with 125 g plain flour, one egg, and some salt, but the day soon comes when you merely think the word batter!

Notation for calculating the mean

No one is hopefully going to baulk at the challenge of calculating the mean height of a sample of five people, say 149, 176, 152, 180, and 146 cm, by totalling the numbers and dividing by 5.

In statistical notation, the instruction to total is ∑, and the whole series of numbers to total is called by a letter, often x or y. So ∑x means ‘add together all the numbers in the series called x’, the five heights of people in our example. We use n for the ‘number of numbers’, 5 in the example, making the full notation for a mean:

However, we use the mean so often that it has another even shorter notation: the identifying letter for the series (e.g. x) with a line over the top, i.e. .

3Summarising variation

Chapter features

Introduction

Different summaries of variation

Why

n

–1?

Why are the deviations squared?

The standard deviation

The next chapter

Introduction

Life would be great if the mean were an adequate summary statistic of a series of numbers. Unfortunately, it is not that useful! Imagine you are frequently buying matchboxes – you may have noticed they are labelled something like ‘Average contents 48 matches’ (I always wonder, why not 50?). You may have bought six boxes of Matchless Matches to verify their claim, and found it unassailable at contents of 48, 49, 49, 47, 48, and 47. The competing product, Mighty Matches also claims ‘Average contents 48 matches’. Again the claim cannot be contested: the six boxes you buy contain respectively 12, 62, 3, 50, 93, and 68. Would you risk buying a box of Mighty Matches, when it might contain only three matches? The mean gives no idea at all of how frequently numbers close to it are going to be encountered. We need to know about how variable the numbers are to make any sense of the mean value.

The example of human heights I use in Chapter 2 straightaway introduces the inevitability of variation as soon as we become involved in biological measurements. Just as people vary in height, so lettuces in the same field will vary in weight, there will be different numbers of blackfly on adjacent broad bean plants, our dahlias will not all flower on the same day, a ‘handful’ of lawn fertiliser will be only a very rough standard measure of volume, and eggs, even from the same farm, will not all be the same size or weight. So how to deal with variation is a vital ‘tool of the trade’ for any biologist.

Now there are several ways we might summarise variation of a set of numbers, and we'll go through them one by one, using and explaining the relevant notation. Alongside this in textboxes (which you can skip if you don't find them helpful) we'll do the calculations on the two samples above from the different matchbox sources – where the means () for Matchless and Mighty are both 48, but with very different variation.

Different summaries of variation

Range

Matchless had the range 47–49, contrasting with the much wider range of 3–93 for Mighty. Although the range clearly does distinguish these two series of matchboxes with the same mean number of matches, we have only used two of the six available numbers in each case. Was the 3 a fluke? Or would it really turn up about one in six times? We really could do with a measure of variation which uses information from all the numbers we have collected.

Total deviation

To make the best use of the numbers in a sample, we really need a measure of variation which includes all the numbers we have (as does the mean). However, just adding the six numbers in each series will give the identical answer of 288 (6 × mean of 48).

The clue to getting somewhere is to realise that if all numbers in one of our matchbox series were identical, there would be no variation whatever, and they would all be the mean, 48. So if the numbers are not the same, but vary, each number's contribution to total variation is its deviation (difference) from the mean. So we could add all the differences of the numbers from the mean (see Box 3.1 for the calculations that apply for our matchbox example). In notation this is , ignoring whether the difference is + or −. The total deviation is only 4 for Matchless compared with 162 for Mighty.

Box 3.1

For Matchless, or 4 if we ignore signs.

For Mighty, , but 162 if we ignore signs.

If we ignore signs, then Matchless has (called the sum of deviations = differences from the mean) of 4, hugely lower than the of 162 for the obviously more variable Mighty.

This looks good. However, there is a major snag! Even if total deviation were a good way of summarising the variation of a set of numbers, the contrast between Matchless and Mighty in Box 3.1 depends on the fact that both samples are of equal size (n = 6). Total deviation will just grow and grow as we include more numbers; it is therefore as much a reflection of sample size as of the variation in the sample. It would be better if our measure of variation were independent of sample size, in the same way as the mean is. We can compare means (e.g. the mean heights of men and women) even if we have measured different numbers of each.

Mean deviation

The amount of variation per number in a set of numbers:

is the obvious way round the problem. The mean (average) deviation will stay much the same regardless of the number of samples. The calculations in Box 3.2 give us the small mean deviation per number of 0.67 for Matchless and the much larger 27 for Mighty.

There's not a lot to criticise about mean deviation as a measure of variability. However, for reasons which come later in this chapter, the standard measure of variation used in statistics is not mean deviation. Nonetheless, the concept of mean deviation brings us very close to what is that standard measure of variation: the variance.

Box 3.2

Mean deviation for the matchbox example:

For Matchless, mean deviation is total variation/6 = 4/6 = 0.67

For the more variable Mighty, mean deviation is 162/6 = 27

Variance

Variance is very clearly related to mean deviation which I'll remind you was:

In words or in notation

Variance is nearly the same, but with two important changes in bold capitals:

Variance is therefore the mean (using n − 1) of the total SQUARED deviation, and the calculations for our matchbox example are given in Box 3.3. The variance of only 0.8 for Matchless contrasts with the much larger 1189.2 for the more variable Mighty boxes.

Two pieces of jargon which will from now on crop up over and over again are the terms given to the top and bottom of the variance formula. The bottom part (n − 1) is known as the degrees of freedom (d.f.). The top part, , which involves adding together the squared deviations, should be called the sum of squares of deviations, but unfortunately this is nearly always contracted to sum of squares. Why is this unfortunate? Well, I'll come to that at the end of this chapter but, believe me now, it is essential for you to remember that