58,99 €
Peter Goos, Department of Statistics, University of Leuven, Faculty of Bio-Science Engineering and University of Antwerp, Faculty of Applied Economics, Belgium David Meintrup, Department of Mathematics and Statistics, University of Applied Sciences Ingolstadt, Faculty of Mechanical Engineering, Germany Thorough presentation of introductory statistics and probability theory, with numerous examples and applications using JMP JMP: Graphs, Descriptive Statistics and Probability provides an accessible and thorough overview of the most important descriptive statistics for nominal, ordinal and quantitative data with particular attention to graphical representations. The authors distinguish their approach from many modern textbooks on descriptive statistics and probability theory by offering a combination of theoretical and mathematical depth, and clear and detailed explanations of concepts. Throughout the book, the user-friendly, interactive statistical software package JMP is used for calculations, the computation of probabilities and the creation of figures. The examples are explained in detail, and accompanied by step-by-step instructions and screenshots. The reader will therefore develop an understanding of both the statistical theory and its applications. Traditional graphs such as needle charts, histograms and pie charts are included, as well as the more modern mosaic plots, bubble plots and heat maps. The authors discuss probability theory, particularly discrete probability distributions and continuous probability densities, including the binomial and Poisson distributions, and the exponential, normal and lognormal densities. They use numerous examples throughout to illustrate these distributions and densities. Key features: * Introduces each concept with practical examples and demonstrations in JMP. * Provides the statistical theory including detailed mathematical derivations. * Presents illustrative examples in each chapter accompanied by step-by-step instructions and screenshots to help develop the reader's understanding of both the statistical theory and its applications. * A supporting website with data sets and other teaching materials. This book is equally aimed at students in engineering, economics and natural sciences who take classes in statistics as well as at masters/advanced students in applied statistics and probability theory. For teachers of applied statistics, this book provides a rich resource of course material, examples and applications.
Sie lesen das E-Book in den Legimi-Apps auf:
Seitenzahl: 446
Veröffentlichungsjahr: 2015
Title Page
Copyright
Dedication
Preface
Software
Data files
Acknowledgments
Chapter 1: What is statistics?
1.1 Why statistics?
1.2 Definition of statistics
1.3 Examples
1.4 The subject of statistics
1.5 Probability
1.6 Software
Chapter 2: Data and its representation
2.1 Types of data and measurement scales
2.2 The data matrix
2.3 Representing univariate qualitative variables
2.4 Representing univariate quantitative variables
2.5 Representing bivariate data
2.6 Representing time series
2.7 The use of maps
2.8 More graphical capabilities
Chapter 3: Descriptive statistics of sample data
3.1 Measures of central tendency or location
3.2 Measures of relative location
3.3 Measures of variation or spread
3.4 Measures of skewness
3.5 Kurtosis
3.6 Transformation and standardization of data
3.7 Box plots
3.8 Variability charts
3.9 Bivariate data
3.10 Complementarity of statistics and graphics
3.11 Descriptive statistics using JMP
Chapter 4: Probability
4.1 Random experiments
4.2 Definition of probability
4.3 Calculation rules
4.4 Conditional probability
4.5 Independent and dependent events
4.6 Total probability and Bayes' rule
4.7 Simulating random experiments
Chapter 5: Additional aspects of probability theory
5.1 Combinatorics
5.2 Number of possible orders
5.3 Applications of probability theory
Chapter 6: Univariate random variables
6.1 Random variables and distribution functions
6.2 Discrete random variables and probability distributions
6.3 Continuous random variables and probability densities
6.4 Functions of random variables
6.5 Families of probability distributions and probability densities
6.6 Simulation of random variables
Chapter 7: Statistics of populations and processes
7.1 Expected value of a random variable
7.2 Expected value of a function of a random variable
7.3 Special cases
7.4 Variance and standard deviation of a random variable
7.5 Other statistics
7.6 Moment generating functions
Chapter 8: Important discrete probability distributions
8.1 The uniform distribution
8.2 The Bernoulli distribution
8.3 The binomial distribution
8.4 The hypergeometric distribution
8.5 The Poisson distribution
8.6 The geometric distribution
8.7 The negative binomial distribution
8.8 Probability distributions in JMP
8.9 The simulation of discrete random variables with JMP
Chapter 9: Important continuous probability densities
9.1 The continuous uniform density
9.2 The exponential density
9.3 The gamma density
9.4 The Weibull density
9.5 The beta density
9.6 Other densities
9.7 Graphical representations and probability calculations in JMP
9.8 Simulating continuous random variables in JMP
Chapter 10: The normal distribution
10.1 The normal density
10.2 Calculation of probabilities for normally distributed variables
10.3 Lognormal probability density
Chapter 11: Multivariate random variables
11.1 Introductory notions
11.2 Joint (discrete) probability distributions
11.3 Marginal or unconditional (discrete) probability distribution
11.4 Conditional (discrete) probability distribution
11.5 Examples of discrete bivariate random variables
11.6 The multinomial probability distribution
11.7 Joint (continuous) probability density
11.8 Marginal or unconditional (continuous) probability density
11.9 Conditional (continuous) probability density
Chapter 12: Functions of several random variables
12.1 Functions of several random variables
12.2 Expected value of functions of several random variables
12.3 Conditional expected values
12.4 Probability distributions of functions of random variables
12.5 Functions of independent Poisson, normally, and lognormally distributed random variables
Chapter 13: Covariance, correlation, and variance of linear functions
13.1 Covariance and correlation
13.2 Variance of linear functions of two random variables
13.3 Variance of linear functions of several random variables
13.4 Variance of linear functions of independent random variables
13.5 Linear functions of normally distributed random variables
13.6 Bivariate and multivariate normal density
Chapter 14: The central limit theorem
14.1 Probability density of the sample mean from a normally distributed population
14.2 Probability distribution and density of the sample mean from a non-normally distributed population
14.3 Applications
14.4 Normal approximation of the binomial distribution
Appendix A: The Greek alphabet
Appendix B: Binomial distribution
Appendix C: Poisson distribution
Appendix D: Exponential distribution
Appendix E: Standard normal distribution
Index
End User License Agreement
xiii
xiv
xv
xvii
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
Cover
Table of Contents
Preface
Begin Reading
Figure 1.1
Figure 2.1
Figure 2.2
Figure 2.3
Figure 2.4
Figure 2.5
Figure 2.6
Figure 2.7
Figure 2.8
Figure 2.9
Figure 2.10
Figure 2.11
Figure 2.12
Figure 2.13
Figure 2.14
Figure 2.15
Figure 2.16
Figure 2.17
Figure 2.18
Figure 2.19
Figure 2.20
Figure 2.21
Figure 2.22
Figure 2.23
Figure 2.24
Figure 2.25
Figure 2.26
Figure 2.27
Figure 2.28
Figure 2.29
Figure 2.30
Figure 2.31
Figure 2.32
Figure 2.33
Figure 2.34
Figure 2.35
Figure 2.36
Figure 2.37
Figure 2.38
Figure 2.39
Figure 2.40
Figure 2.41
Figure 2.42
Figure 2.43
Figure 2.44
Figure 2.45
Figure 2.46
Figure 2.47
Figure 2.48
Figure 2.49
Figure 2.50
Figure 2.51
Figure 2.52
Figure 2.53
Figure 2.54
Figure 2.55
Figure 2.56
Figure 2.57
Figure 2.58
Figure 3.1
Figure 3.2
Figure 3.3
Figure 3.4
Figure 3.5
Figure 3.6
Figure 3.7
Figure 3.8
Figure 3.9
Figure 3.10
Figure 3.11
Figure 3.12
Figure 3.13
Figure 3.14
Figure 3.15
Figure 3.16
Figure 3.17
Figure 3.18
Figure 3.19
Figure 3.20
Figure 3.21
Figure 3.22
Figure 3.23
Figure 3.24
Figure 3.25
Figure 3.26
Figure 4.1
Figure 4.2
Figure 4.3
Figure 6.1
Figure 6.2
Figure 6.3
Figure 6.4
Figure 6.5
Figure 6.6
Figure 6.7
Figure 6.8
Figure 6.9
Figure 6.10
Figure 6.11
Figure 6.12
Figure 8.1
Figure 8.2
Figure 8.3
Figure 8.4
Figure 8.5
Figure 8.6
Figure 8.7
Figure 8.10
Figure 8.11
Figure 8.9
Figure 8.12
Figure 8.13
Figure 8.14
Figure 8.15
Figure 8.16
Figure 8.17
Figure 8.18
Figure 8.19
Figure 8.20
Figure 8.21
Figure 8.22
Figure 8.23
Figure 8.24
Figure 8.25
Figure 8.26
Figure 8.27
Figure 8.28
Figure 8.29
Figure 8.30
Figure 8.31
Figure 8.32
Figure 8.33
Figure 8.34
Figure 8.35
Figure 8.36
Figure 9.1
Figure 9.2
Figure 9.3
Figure 9.4
Figure 9.5
Figure 9.6
Figure 9.7
Figure 9.8
Figure 9.9
Figure 9.10
Figure 9.11
Figure 9.12
Figure 9.13
Figure 9.14
Figure 9.15
Figure 9.16
Figure 10.2
Figure 10.3
Figure 10.4
Figure 10.5
Figure 10.6
Figure 10.7
Figure 10.8
Figure 10.9
Figure 10.10
Figure 10.11
Figure 10.12
Figure 10.13
Figure 10.1
Figure 11.1
Figure 11.2
Figure 11.3
Figure 11.4
Figure 11.5
Figure 11.6
Figure 11.7
Figure 13.1
Figure 13.2
Figure 13.3
Figure 13.4
Figure 13.5
Figure 13.6
Figure 14.1
Figure 14.2
Figure 14.3
Figure 14.4
Figure 14.5
Figure 14.6
Figure 14.7
Table 2.1
Table 2.2
Table 2.3
Table 2.4
Table 2.5
Table 2.6
Table 3.1
Table 3.2
Table 3.3
Table 3.4
Table 3.5
Table 3.6
Table 3.7
Table 3.8
Table 3.9
Table 3.10
Table 3.11
Table 3.12
Table 3.13
Table 3.14
Table 3.15
Table 4.1
Table 4.2
Table 4.3
Table 4.4
Table 5.1
Table 5.2
Table 6.1
Table 6.2
Table 8.1
Table 8.2
Table 8.3
Table 8.4
Table 8.5
Table 8.6
Table 8.7
Table 10.1
Table 11.1
Table 11.2
Table 11.3
Table 11.4
Table 11.5
Table 11.6
Table 11.7
Table 11.8
Table 11.9
Table 13.1
Peter Goos
University of Leuven and University of Antwerp, Belgium
David Meintrup
University of Applied Sciences Ingolstadt, Germany
This edition first published 2015
© 2015 John Wiley & Sons, Ltd
Registered office
John Wiley & Sons Ltd, The Atrium, Southern Gate, Chichester, West Sussex, PO19 8SQ, United Kingdom
For details of our global editorial offices, for customer services and for information about how to apply for permission to reuse the copyright material in this book please see our website at www.wiley.com.
The right of the author to be identified as the author of this work has been asserted in accordance with the Copyright, Designs and Patents Act 1988.
All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, recording or otherwise, except as permitted by the UK Copyright, Designs and Patents Act 1988, without the prior permission of the publisher.
Wiley also publishes its books in a variety of electronic formats. Some content that appears in print may not be available in electronic books.
Designations used by companies to distinguish their products are often claimed as trademarks. All brand names and product names used in this book are trade names, service marks, trademarks or registered trademarks of their respective owners. The publisher is not associated with any product or vendor mentioned in this book.
Limit of Liability/Disclaimer of Warranty: While the publisher and author have used their best efforts in preparing this book, they make no representations or warranties with respect to the accuracy or completeness of the contents of this book and specifically disclaim any implied warranties of merchantability or fitness for a particular purpose. It is sold on the understanding that the publisher is not engaged in rendering professional services and neither the publisher nor the author shall be liable for damages arising herefrom. If professional advice or other expert assistance is required, the services of a competent professional should be sought.
Library of Congress Cataloging-in-Publication Data applied for.
A catalogue record for this book is available from the British Library.
ISBN: 9781119035701
To Marijke, Bas, Loes and MienTo Béatrice and Werner
This book is the result of a thorough revision of the lecture notes “Descriptive Statistics and Probability” that were developed by Peter Goos for the course “Statistics for business and economics 1” at the Faculty of Applied Economics of the University of Antwerp in Belgium. Encouraged by the success of the Dutch version of this book (entitled Beschrijvende Statistiek en Kansrekenen, published in 2013 by Acco Leuven/Den Haag), we joined forces to create an English version. The book provides a detailed treatment of basic probability theory, descriptive statistics, and graphical representations of data. We pay equal attention to mathematical aspects, the interpretation of all the statistical concepts that are introduced, and their practical application. In order to facilitate the understanding of the methods and to appreciate their usefulness, the book contains many examples involving real-life data. To demonstrate the broad applicability of statistics and probability, these examples have been taken from various fields of application, including business, economics, sport, engineering, and natural sciences.
We had two motivations in writing this book. First, we wanted to provide students and teachers with a resource that goes beyond other textbooks of similar scope in its technical and mathematical content. It has become increasingly fashionable for authors and statistics teachers to sweep technicalities and mathematical derivations under the carpet. We decided against this, because we feel that students should be encouraged to apply their mathematical knowledge and that doing so deepens their understanding of statistical methods. Reading this book requires some knowledge of mathematics, including the use of derivatives, integrals, and some matrix algebra. In most countries, these topics are taught in secondary or high school. Moreover, these topics are often revisited in introductory mathematics courses at university. Therefore, we are convinced that many university students have a sufficiently strong mathematical background to appreciate and benefit from the more thorough nature of this book. A particular strength is that all mathematical derivations are shown in detail. We included all intermediate steps, even those that might be trivial for mathematicians. We hope that this keeps the book readable for less mathematically gifted readers and also shows that the mathematical derivations are actually not as difficult as these readers might first imagine.
Our second motivation was to ensure that the concepts introduced in the book can be successfully put into practice. To this end, we show how to generate graphs, calculate descriptive statistics and compute probabilities using the statistical software package JMP (pronounced “jump”). We chose JMP as supporting software because it is powerful yet easy to use, and suitable for a wide range of statistically oriented courses (including descriptive statistics, hypothesis testing, regression, analysis of variance, design of experiments, reliability, multivariate methods, and statistical and predictive modeling). We believe that introductory courses in statistics and probability should use such software so that the enthusiasm of students is not nipped in the bud. Indeed, we find that, because of the way students can easily interact with JMP, it can actually spark enthusiasm for statistics and probability in class.
In summary, our approach to teaching descriptive statistics and probability theory combines theoretical and mathematical depth, detailed and clear explanations, numerous practical examples, and the use of a user-friendly and yet very powerful statistical package. Our companion book Statistics with JMP: Hypothesis Tests, ANOVA and Regression, (based on Verklarende Statistiek: Schatten en Toetsen, Acco Leuven/Den Haag, 2014), follows the same philosophy.
As mentioned, we use JMP as enabling software. With the purchase of a hard copy of this book, you receive a one-year license for JMP's Student Edition. The license period starts when you activate your copy of the software using the code included with this book (to receive an access code, please visit www.wiley.com/go/statsjmpgdsp). To download JMP's Student Edition, visit http://www.jmp.com/wiley. For students accessing a digital version of the book, your lecturer may contact Wiley in order to procure unique codes with which to download the free software. For more information about JMP, go to http://www.jmp.com. JMP is available for Windows and Mac operating systems. This book is based on JMP version 12 for Windows.
In our examples, we do not assume any familiarity with JMP: the step-by-step instructions are detailed and accompanied by screenshots. For more explanations and descriptions, www.jmp.com offers a substantial amount of free material, including many video demonstrations. In addition, there is a JMP Academic User Community where you can access content, discuss questions and collaborate with other JMP users worldwide: instructors can share teaching resources and best practices, students can ask questions, and everyone can access the latest resources provided by the JMP Academic Team. To join the community, go to http://community.jmp.com/academic.
Throughout the book, various data sets are used. We strongly encourage everybody who wants to learn statistics to actively try things out using data. JMP files containing the data sets as well as JMP scripts to reproduce figures, tables, and analyses, can be downloaded from the publisher's companion website to this book:
http://www.wiley.com/go/goosandmeintrup/JMP
There, we also provide some additional supporting files to generate maps, or visualize probability distributions and densities. For instructors who would like to use the book in their courses, there are slides available that cover the material presented. The information on how to access these teaching resources can also be found on the companion website.
Peter [email protected]
David [email protected]
We would like to thank numerous people who have made the publication of this book possible. The first author, Peter Goos, is very grateful to Professor Willy Gochet from the University of Leuven, who introduced him to the topics of statistics and probability. Professor Gochet allowed Peter to use his lecture notes as a backbone for his own course material, which later developed into this book. The second author, David Meintrup, would like to thank Antonio Sáez for providing a perfect working environment during his sabbatical at the University of Jaén, Spain.
The authors are very grateful for the support and advice offered by several people from the JMP Division of SAS: Brady Brady, Ian Cox, Bradley Jones, Volker Kraft, John Sall, and Mia Stephens. It is Volker who brought the two authors together and encouraged them to work on a series of English books on statistics with JMP (the second book is entitled Statistics with JMP: Hypothesis Tests, ANOVA and Regression). A very special thank you goes to Ian, whose suggestions substantially improved this book. The authors would also like to thank Leonids Aleksandrovs, Kris Annaert, Stefan Becuwe, Filip De Baerdemaeker, Roselinde Kessels, Ida Ruts, Bagus Sartono, Evelien Stoffels, Anja Struyf, Utami Syafitri, Peter Thijssen, Anil Haydar Topal, Katrien Van Driessen, Ellen Vandervieren, Kristel Van Rompaey, Diane Verbiest, Sara Weyns, and Simone Willis for their detailed comments and constructive suggestions, and technical assistance in creating figures.
Finally, we thank Debbie Jupe, Heather Kay, Sangeetha Parthasarathy and Prachi Sinha Sahay at Wiley.
The world is ready for the truth; the modern age is here; every year another report appears that examines poverty by means of statistical research rather than romantic claptrap.
(from The Crimson Petal and the White, Michael Faber, p. 334)
In this introductory chapter, we give a general description of the topics of statistics and probability theory. Some examples illustrate the purpose and applications of both disciplines, as well as the differences between them. As statistics has more applications in science, industry, and economics than probability theory, statistics is typically given far more attention in degree subjects like business, industrial and bio-science engineering, applied economics, and natural or social sciences. Nevertheless, one should pay some attention to probability theory as well. In fact, both disciplines are strongly connected to each other: it is impossible to understand the working of statistical inference without a sound knowledge of probability theory. Therefore, in this book, we discuss both probability theory and statistics.
For many years, statistics has been a subject, often a dreaded one, in several fields of study at universities and colleges. The reason is that quite a few people will, sooner or later, be confronted with problems of data analysis during their professional activities. A sound statistical background not only allows us to analyze the data and to make concrete decisions based on the analysis, but it also provides an advantage in the data collection process.
Nevertheless, statistics is not immediately perceived as useful by most students. This is mainly due to the fact that, during a statistics course, they are still unfamiliar with the sorts of practical decision problems managers, economists, engineers, andresearchers face on a daily basis. Many students will start realizing the usefulness of statistics when they start to work on their bachelor's or master's thesis. The many examples in this basic course are intended to advance this awareness by several years.
In an introductory statistics course, one often finds a whole series of quotes as an attempt to motivate students. A classic example is “Statistical thinking will one day be as necessary for efficient citizenship as the ability to read and write.” from the British writer Herbert George Wells (1866–1946). More recent is the judgment by the US quality guru W. Edwards Deming, to whom a large part of the downright spectacular economic recovery in Japan after World War II is attributed. He claimed that “Statistics is too important to be left to statisticians. The goal is to have many statistically-skilled workers: engineers, scientists, managers...” Hal Varian, chief economist at Google says the following: “I keep saying that the most sexy job in the next 10 years will be statistician. And I'm not kidding.” In Europe, Willy Buysse, former CEO at SN Brussels Airlines, states that too few decisions are made based on data. Only recently, his many years of diligence establishing a research department, where statistical and other quantitative methods are used to address all sorts of problems, has been rewarded.
Another justification for a thorough training in statistical methods can be found in the so-called Six Sigma improvement program. The purpose of this program is to solve concrete problems with a large financial impact both in service and industrial companies, and to reduce the number of faults and defects to 3.4 per million opportunities. The approach is based on statistical methods, as presented in Figure 1.1. The figure shows that the traditional method to solve a practical problem is to immediately search for practical solutions. This approach is typically based on guessing and trial-and-error, so that it will often take a long time to find a final solution to the problem. The Six Sigma improvement program promotes a more thoughtful, scientific approach to problems. First, data is collected in the so-called measurement phase. Then, using statistical methods, the data is carefully examined. This often leads to interesting insights and recommendations to improve existing products, services, or processes. The Six Sigma approach also relies on the use of statistical process control and statistically designed experiments. Hence, statistics helps to find the best possible solution for all kinds of practical problems.
Figure 1.1 Using statistical methods to solve problems.
To achieve a successful cooperation between practitioners, on one hand, and statisticians, on the other, some openness is required on both sides. Engineers, economists, or scientists need a solid knowledge of the basic principles and techniques of statistics. Statistics is thus an indispensable skill in the repertoire of an effective employee. This explains why statistics is taught not only in the first and second years of many bachelor's degrees in engineering, sciences, and economics, but also later, for example in master's programs.
Finally, a thorough training in statistics is also a prerequisite for students of political and social sciences. They will also be confronted with numerous data sets in their professional careers that are impossible to interpret without a statistical background. For them, statistics is a stepping stone to econometric research methods.
The word statistics may sound familiar to anyone. A statistic usually refers to numerical information, for example, information about
the population of a country: birth and death rates, immigration and emigration,…(such statistics are called population statistics),
the economy: employment and unemployment rates, investments, prices, gross national products (GNP),…(these statistics are called economic statistics), or
a company or sector: sales figures, income statements, growth, acquisitions, layoffs,…(these figures are called business statistics).
More formally, statistics can be defined as the set of methodologies for collecting, representing, analyzing, and interpreting data. This shows that the statistical science is a very general auxiliary science, which plays an important role in almost any environment. Applications of statistics are countless in engineering, medicine, economics, natural sciences, and business management, but statistics is also used in literature, history, political science, criminology, and even musicology.
In our modern society, data is massively present:
computer files in companies contain sales data, cost data, and customer data (such as addresses, ordered quantities, and order frequencies),
the financial pages of newspapers contain stock prices, commodity prices, and exchange rates,
federal and regional authorities regularly publish data on population, trade, and industry, and
the Internet is a source of numerous data sets.
Companies collect data naturally and actively. Among other things, this takes place by carrying out experiments (e.g., to design new products), in the context of statistical process control, or by measuring all kinds of properties of products, services, and processes. By continuously analyzing data, quality departments of companies attempt to deliver products or services with as few defects as possible and with the highest reliability. In addition, business processes are organized in such a way that waste is minimized, inspections of finished products are reduced to the minimum, and customer requirements are satisfied with minimal costs.
Research agencies collect data via surveys by phone, by post, via the Internet or by street interviews. Such surveys are designed to gather information about the shopping behavior of consumers, about the voting behavior of the population, or public opinion on social issues.
Statistics allows us to turn data into usable information. The role that statistics plays herein may be best illustrated based on some examples.
An airline conducted a study on the behavior of its passengers on intercontinental flights and recorded
the number of passengers with reservations that do not show up (the so-called no-shows),
the weight of the luggage of passengers (often there is a limit of 20 kilograms), and
the time the passengers arrive before the official departure time of the flight (for intercontinental flights, the passengers are asked to be at the airport at least two hours prior to departure).
The company recorded this data over several months and then made a distinction between passengers in economy class and passengersin business class. The data is analyzed with the aim of instituting appropriate policies. An example may be to allow overbooking, that is, to take more reservations than there are seats on the plane, or to apply more stringent action against passengers carrying too much luggage.
In the production of coffee, the humidity during production is of crucial importance for the quality of the final product. The humidity is kept under control by a system that does not work flawlessly. Therefore, several measurements of the humidity are taken daily to determine whether it remains within appropriate limits. This approach is referred to as statistical process control.
A filling machine for bottles usually has several filling heads, so that many bottles can be filled in parallel. In such a filling process, operators typically weigh a certain number of bottles every hour, to verify that each filling head delivers the desired amount of liquid into the bottles. Another interesting question in this context is whether differences occur between measurements that have been carried out by different operators.
Thanks to loyalty cards, supermarkets collect massive data sets. Data that is typically recorded includes
the amount spent per visit at the store, maybe broken down into categories (food, clothing,…),
the number of items sold,
the payment method (cash, debit card, credit card, or voucher).
Researchers use statistical methods to summarize this huge amount of information and to present it in a way suitable for decision making. Supermarkets exploit this information to send out personalized promotional materials.
Financial analysts are interested in the degree of risk of investing in a particular stock. To this end, they keep track of the monthly return rates of stocks over many years. They take into account price changes, but the dividends as well. Moreover, monthly return rates of the global market, for example the Euro Stoxx 50 index, are tracked. If the return rate of a stock rises or falls to a larger extent than the market, then the share is called risky. In the opposite case, one speaks of a share with little risk. Using statistical methods, one can investigate relations between the return rate of the stock and of the overall market.
In each of the examples in the previous section, the interest is in one or more questions concerning a population of objects or elements, or concerning a process that generates objects or elements.
The data of the population or process is obtained by recording one or more properties or characteristics of their elements. These properties or characteristics are called variables. The name indicates that the value of the property varies from element to element. Therefore, statistics is sometimes referred to as the study of variability.
Usually, it is impossible to include all elements of a population or process in a study. Therefore, one works with a subset of the elements: the sample. It is not always easy to collect sample data in a correct way. In any statistical survey, one should pay a lot of attention to the data collection process. In this context, the abbreviation GIGO1 is often used. This stands for garbage in, garbage out and refers to the fact that any statistical methods can only extract little reliable information from data of poor quality.
For a study of the electoral behavior in European elections, the population can be described easily: all citizens of Europe who are entitled to vote. Variables that could be registered in this context are gender, occupation, political beliefs, age, and so on.
Tossing a die is a process that generates data. A possible sample involves throwing the die 50 times. Variablesthat could be registered are the number of dots or whether or not the number of dots is even.
In Examples 1.3.2 and 1.3.3, we can consider all times at which the production process is in operation to be the population. At a limited or finite number of points in time, measurements or observations are made, for example, measurements of the humidity (Example 1.3.2) or weight (Example 1.3.3). All measurements together form the sample. For the financial analyst in Example 1.3.5, the sample is formed by a finite set of return rates and market indices. In Example 1.3.4, the population of interest for the researcher is the set of all customers of the supermarket. One possible sample consists of all customers that have visited the store during one month and that made use of their loyalty card.
The data collected in a sample can be represented in many ways using tables and graphs. In addition, one can calculate characteristic values or statistics, such as the mean, to generate a clear idea of the collected data. The different ways of presenting sample data are summarized under the term descriptive statistics. This topic is covered in Chapters 2 and 3.
In many cases, describing the sample data is only a first step in an investigation. A second phase involves analyzing and interpreting the sample. Analysis and interpretation is necessary in order to find answers to questions about the population or process that were set in advance, to test hypotheses, or to assess the quality of a proposed statistical model. The answers and conclusions obtained from the statistical analysis are generalized to the population or the process. This generalization is called inference, which explains the term inferential statistics.
The generalization of conclusions from sample data to an entire population or to a process immediately discloses the weakness of statistics: based on sample data, one can never make statements with certainty about the population or process in question. These statements may be considered reliable if statistically valid methods were used for the collection of the sample data. The degree of confidence in a particular statement is expressed by means of a probability, so that a basic knowledge of probability theory is required to be able to understand and apply statistical methods.
The words chance and probability sound even more familiar than the term statistics. Intuitively, everyone has a good idea of the meaning of aprobability of 1/4 when participating in a gambling game. Such a probability can be used by virtually everyone to decide whether or not to participate in the game. However, the calculation of such a probability can already raise difficulties.
Probability theory studies processes or experiments in which the outcome is uncertain. Here, the terms process and experiment should be interpreted in their broadest sense. Examples are throwing a die, the price of a share when the stock exchange closes, a mortgage interest rate, the demand for laptop computers of a particular brand, the percentage of defective products in a production line during a certain period, the number of visitors to a website, or drawing a winner from all the participants in a lottery.
The difference between probability and statistics is that, in probability theory, populations and processes are studied directly, while statistics does this through sample data. Probability theory always starts with a set of assumptions about the population or the process. Some examples will illustrate this.
If the process is throwing a die, then, with the help of probability theory, we can try to figure out the probability of obtaining a six 20 or more times when we toss the die 100 times. This calculation is only possible if we make an important assumption about the die used: the die is fair, or, in other words, the die is completely homogeneous and symmetrical, so that it is equally likely to obtain a one as it is to obtain a two, a three, a four, a five, or a six.
A statistical question about the die could be to investigate the fairness of the die. The die may be thrown a (large) number of times to collect the required sample data. Based on these data, one can draw a statistical conclusion about the hypothesis that the die is fair.
In an industrial filling process, one can calculate, based on some assumptions concerning the settings and the accuracy of the filling machine, the probability that a bottle will not be full enough. Another possibility is to calculate the probability that, in a lot with 1,000 bottles, at most 5% of the bottles will not have been filled enough.
A statistical analysis of the same filling process typically may involve regular weighings of a number of bottles (the sample), in order to verify whether the average content of the bottles is too large or too little, and whether or not the content of the bottles varies too much.
Using probability theory, one could study the electoral behavior of the European population assuming that 30% will vote for party A, 25% for party B, 20% for party C, and 25% for smaller parties. Probability theory can then calculate that, for every 500 voters, on average 150 will opt for party A, 125 for Party B, 100 for Party C, and 125 will choose other parties.
Statistics, however, will make a statistical prediction based on a sample of, for example, 2,000 voters. This prediction can also be given with a margin of error.
It is important to realize that statistics works with a limited amount of information obtained from a sample. Therefore, statements about populations and processes can be false. This is the weakness of statistics. Ideally, the probabilities of error are small. The probability for errors can be reduced by collecting a lot of high quality data in a sensible manner.
Probability theory also has a weakness: the assumptions about the studied process or population may be wrong, so that its conclusions are invalid.
In probability and statistics, a lot of calculations are needed. It is important to create summary tables of all the data in a sample, or to represent the data graphically. This makes the use of a computer and of specialized statistical software necessary. As mentioned in the Preface, in this book, we use the statistical software package JMP®.
1
This abbreviation is a parody of the abbreviations FIFO (first in first out) and LIFO (last in first out), used in accounting for booking items in stock.
A microphone in the sidewalk would provide an eavesdropper with a cacophony of clocks, seemingly random like the noise from a Geiger counter. But the right kind of person could abstract signal from noise and count the pedestrians, provide a male/female breakdown and a leg-length histogram …
(from Cryptonomicon, Neal Stephenson, p. 147)
Data is a set of measurements of one or more characteristics or variables of some elements of a population, or of a number of objects generated by a process. Different types of variables can be measured.
Variables are classified according to the measurement scale on which they are measured. Categorical or qualitative variables are measured on a nominal scale or on an ordinal scale. Quantitative variables are either measured on an interval scale or on a ratio scale.
Elements of a sample or a population can be classified using a nominal variable: the value of the variable places an element in a certain class or category. Examples of such variables are
gender (male/female),
nationality (Belgian, German, and so on),
religion (Catholic, Protestant, and so on), and
whether or not one owns a car (yes/no).
Sometimes it can be useful to assign labels, code numbers, or code letters, to the different classes or categories. For example, a Belgian person may be assigned the code “1”, a Dutch person the code “2”, a French person the code “3”, and a German person the code “4”. It is important to note that these figures do not imply any order and/or quantity. Therefore, except for calculations of frequencies and percentages, most arithmetic operations on nominal variables are meaningless.
If a nominal variable implies a logical order between the elements of a sample, then the variable is ordinal. Typical examples of ordinal variables can be found in all kinds of surveys. There, respondents are typically asked whether they consider the quality of a product or service as “1: very good”, “2: good”, “3: moderate”, “4: bad”, or “5: very bad”. In other surveys, the respondents are asked if they “1: strongly disagree”, “2: rather disagree”, “3: neither agree nor disagree”, “4: rather agree”, or “5: strongly agree” with a particular statement. Other examples of ordinal variables include the number of Michelin stars of restaurants and the number of stars of hotels.
An ordinal scale has no fixed measurement unit. This means that the difference between two levels cannot be expressed as a number of units on the measuring scale. For example, the difference between a hotel with three stars and one with two stars is not necessarily the same as the difference between a hotel with two stars and one with only one star. It is obvious that it is also not very useful to perform arithmetic operations with ordinal variables.
A variable that is measured on a quantitative scale can be expressed as a fixed number of measurement units. Examples are length, area, volume, weight, duration, number of bits per unit of time, price, income, waiting time, number of ordered goods, and so on. For quantitative variables, almost all arithmetic operations make sense. This is due to the fact that the difference betweentwo levels of a quantitative variable can be expressed as a number of units in contrast to differences between two levels of an ordinal variable. Within the class of quantitative variables, a distinction is made between variables that are measured on an interval scale and variables measured on a ratio scale.
An interval scale has no natural zero point, that is, no natural lower limit. For variables measured on an interval scale, calculating ratios is not meaningful. Well-known examples of interval variables are the time read on a clock or the temperature expressed in degrees Celsius or Fahrenheit. The difference between 2 o'clock and 4 o'clock is the same as the difference between 21:00 and 23:00, but it's not like 4 o'clock is twice as late as 2 o'clock. This is due to the fact that time read on a clock has no absolute zero. The same applies to the temperature measured in degrees Celsius: C is not four times as hot as C.
A ratio scale does have an absolute zero. Therefore, for variables measured on a ratio scale, ratios can be calculated. A length of 6 cm is twice as much as a length of 3 cm, as the length scale has an absolute zero point. Analogously, an order of six products is twice as large as an order of three products. The temperature measured in Kelvin does have an absolute minimum, so that temperature is sometimes measured on a ratio scale. Zero Kelvin (C) is the coldest possible temperature, and therefore an absolute lower limit for the temperature.
A discrete variable can only take a finite or infinite countable number of different values, while a continuous variable can take a continuum of values. Examples of discrete variables are the number of passengers on a flight, the number of children in a family, or the number of insurances that a family contracted. Examples of continuous variables are length, duration, weight, and body mass index.
In practice, all observations of a continuous variable are discrete: a continuous length is measured up to a certain accuracy (e.g., one millimeter), thus turned into a discrete number. Nevertheless, we will consider length as a continuous variable.
It is clear that there is a hierarchy in the measurement scales. The highest or most informative measurement scale is the ratio scale, followed by the interval scale, the ordinal, and the nominal scale. Data that has been measured on a certain scale can be transformed into data of a lower measurement scale. Data measured on a ratio scale (e.g., length) are naturally interval scaled (the difference between 6 and 3 cm is the same as the difference between 15 and 12 cm), ordinal (ordering lengths is meaningful), and nominal (lengths can be divided into classes). Conversely, nominal data can never be transformed into ordinal or quantitative data. Therefore, all techniques that are applicable to nominal data are automatically also applicable to ordinal and quantitative data. All techniques that are applicable to ordinal data can be useful for quantitative data. One rarely makes a distinction between data measured on an interval scale and data measured on a ratio scale.
JMP distinguishes between nominal, ordinal, and quantitative variables. The software refers to measurement scale as “Modeling type”, and uses “Nominal”, “Ordinal”, and “Continuous” for nominal, ordinal, and quantitative variables, respectively.
Data is often presented in a matrix, with a row for each element or observation of a sample, and a column for every measured variable. A complete row in a data matrix is sometimes referred to as an observation vector.
Figure 2.1 contains data from a survey on a number of characteristics of Spanish red wines. The sample contains 70 wines. Figure 2.2 shows the symbols that JMP is using to indicate the different measurement scales, “Nominal”, “Ordinal”, and “Continuous”. The variable “Name” is a nominal variable. The variables “Rating” and “Price category” are ordinal variables. The other variables are quantitative. The measurement scale of a variable can be changed in JMP by a right-click on the name of a column, and then selecting “Column info”.
Figure 2.1 Part of the data matrix on Spanish red wines.
Figure 2.2 Symbols used by JMP for the different measurement scales.
In this chapter, we will mainly treat so-called univariate and bivariate representations of variables. A univariate representation refers to one variable, while a bivariate representation refers to two variables simultaneously. Likewise, multivariate data is nothing but data consisting of several variables. In the remainder of the chapter, we assume that we have a data sample. However, the various representations that we will address may also be used for data of entire populations.
Categorical or qualitative variables allow us to put data into categories or classes. The absolute frequency, or simply the frequency, of a class is the number of elements of the sample that belong to that class. The relative frequency of a class is the ratio of the frequency and the total number of observations in the sample.
The data set described here on Spanish wines contains the final rating of the wines. The following coding is used:
E: excellent,
G/E: good to excellent,
G: good,
F/G: fair to good,
F: fair, and
P/F: poor to fair.
The final rating is clearly a qualitative, ordinal variable. The absolute and relative frequencies for each class are shown in Table 2.1, which is called a frequency table. The same information can also be presented using a bar chart. Figure 2.3shows two versions of a bar chart, which have exactly the same shape. The bar chart in Figure 2.3a shows the absolute frequencies, while that in Figure 2.3b displays the relative frequencies.
Rating
E
G/E
G
F/G
F
P/F
Sum
Abs. frequency
3
5
16
35
9
2
70
Rel. frequency
.043
.071
.229
.500
.129
.029
1
Figure 2.3 Bar charts for the final rating of Spanish red wines.
It is useful to let JMP know that a rating “Excellent” is better than a rating “Good to excellent”, and that a rating “Good to excellent” is in turn better than a rating “Good”. This can be done by right-clicking on the column heading “Rating”, choosing “Column Properties” in the resulting pop-up menu, and selecting the option “Value Ordering”. To create a bar chart in JMP, one can use the “Chart” option in the “Graph” menu. After choosing that option, the variable “Rating” has to be selected as well as the desired type of chart, “Bar Chart”. For a bar chart showing absolute frequencies, the option “N” has to be chosen under “Statistics”. In order to show relative frequencies instead, the option “% of Total” has to be picked. A frequency table can be obtained in JMP using the option “Tabulate” within the “Analyze” menu. If you want to display the result in a separate data table, you need to select the option “Make Into Data Table” in the pop-up menu that appears when clicking on the red triangle icon next to the word “Tabulate”. This is illustrated in Figure 2.4. Such a red triangle is called a hotspot in JMP. Hotspots appear in practically all reports and data tables. Clicking a hotspot always opens a menu containing additional options that are specific to the graphical or statistical analysis you are doing.
Figure 2.4 Creating a frequency table in JMP.
If the classes are arranged in decreasing order of their frequency and the cumulative frequencies are plotted, the result is called a Pareto chart, a Pareto diagram, or a Pareto plot. The purpose of a Pareto chart is to draw attention to the classes with the highest frequencies1. A cumulative representation of the frequencies means that the frequencies of the different classes are summed. This is clarified in the following example.
The quality department of a manufacturer of mobile phones inspected 2530 devices. During the inspection the employees found 115 faulty phones. Devices with scratched surfaces or cracks, deformed devices, and devices with missing parts (incomplete) were labeled as defective. The data, a bar chart, and the corresponding Pareto chart are shown in Figure 2.5.
Figure 2.5Causes of defective mobile phones in Example 2.3.2.
In the Pareto chart in Figure 2.5c, the left vertical axis is for the bars, while the right vertical axis is for the cumulative frequencies shown by means of the black line. It can easily be seen in the Pareto chart that the most common problem is missing parts. This problem has a relative frequency of . The second most common problem is the occurrence of scratches, with a relative frequency of . The relative frequency of the two most common problems together is . If we add the relative frequency of devices with cracks to this, we obtain a cumulative frequency of .
To create a Pareto chart in JMP, one can use the “Analyze” menu. In this menu, the option “Quality and Process” has to be chosen first. The next step is to select the option “Pareto Plot”. Figure 2.6 shows the resulting dialog window, in which the variable “Type of Defect” has to be entered in the field “Y, Cause”, and the variable “Absolute Frequency” has to be entered in the field “Freq”.
Figure 2.6 Dialog window for creating a Pareto chart in JMP.
Another graphical representation of absolute and relative frequencies for a qualitative variable is the pie chart.
Figure 2.7 shows the market share (in percent) of various operating systems on smartphones in the first quarter of 2012. One possible way to make a pie chart in JMP is via the menu “Graph”, by using the option “Chart”, and selecting “Pie Chart”.
Figure 2.7 Market share (in percent) of operating systems for smartphones in the first quarter of 2012.
The stem and leaf diagram is an interesting representation of quantitative data because it does not only give a picture of the frequencies of the various kinds of values for the variable under study, it also preserves every individual observation.
Figure 2.8