52,99 €
An accessible text that explains fundamental concepts in business statistics that are often obscured by formulae and mathematical notation A Guide to Business Statistics offers a practical approach to statistics that covers the fundamental concepts in business and economics. The book maintains the level of rigor of a more conventional textbook in business statistics but uses a more stream-lined and intuitive approach. In short, A Guide to Business Statistics provides clarity to the typical statistics textbook cluttered with notation and formulae. The author--an expert in the field--offers concise and straightforward explanations to the core principles and techniques in business statistics. The concepts are intro-duced through examples, and the text is designed to be accessible to readers with a variety of backgrounds. To enhance learning, most of the mathematical formulae and notation appears in technical appendices at the end of each chapter. This important resource: * Offers a comprehensive guide to understanding business statistics targeting business and economics students and professionals * Introduces the concepts and techniques through concise and intuitive examples * Focuses on understanding by moving distracting formulae and mathematical notation to appendices * Offers intuition, insights, humor, and practical advice for students of business statistics * Features coverage of sampling techniques, descriptive statistics, probability, sampling distributions, confidence intervals, hypothesis tests, and regression Written for undergraduate business students, business and economics majors, teachers, and practitioners, A Guide to Business Statistics offers an accessible guide to the key concepts and fundamental principles in statistics.
Sie lesen das E-Book in den Legimi-Apps auf:
Seitenzahl: 351
Veröffentlichungsjahr: 2018
Cover
Title Page
Copyright
Dedication
Preface
Addressing Two Challenges
How to Use This Book
Target Audience
Chapter 1: Types of Data
1.1 Categorical Data
1.2 Numerical Data
1.3 Level of Measurement
1.4 Cross-Sectional, Time-Series, and Panel Data
1.5 Summary
Chapter 2: Populations and Samples
2.1 What is the Population of Interest?
2.2 How to Sample From a Population?
2.3 Getting the Data
2.4 Summary
Chapter 3: Descriptive Statistics
3.1 Measures of Central Tendency
3.2 Measures of Variability
3.3 The Shape
3.4 Summary
Technical Appendix
Chapter 4: Probability
4.1 Simple Probabilities
4.3 Conditional Probabilities
4.4 Summary
Technical Appendix
Chapter 5: The Normal Distribution
5.1 The Bell Shape
5.2 The Empirical Rule
5.3 Standard Normal Distribution
5.4 Normal Approximations
5.5 Summary
Technical Appendix
Chapter 6: Sampling Distributions
6.1 Defining a Sampling Distribution
6.2 The Importance of Sampling Distributions
6.3 An Example of a Sampling Distribution
6.4 Characteristics of a Sampling Distribution of a Mean
6.5 Sampling Distribution of a Proportion
6.6 Summary
Technical Appendix
Chapter 7: Confidence Intervals
7.1 Confidence Intervals for Means
7.2 Confidence Intervals for Proportions
7.3 Sample Size and the Width of Confidence Intervals
7.4 Comparing Two Proportions From the Same Poll
7.5 Summary
Technical Appendix
Chapter 8: Hypothesis Tests of a Population Mean
8.1 Two-Tail Hypothesis Test of a Mean
8.2 One-Tail Hypothesis Test of a Mean
8.3 -Value Approach to Hypothesis Tests
8.4 Summary
Technical Appendix
Chapter 9: Hypothesis Tests of Categorical Data
9.1 Two-Tail Hypothesis Test of a Proportion
9.2 One-Tail Hypothesis Test of a Proportion
9.3 Using -Values
9.4 Chi-Square Tests
9.5 Summary
Technical Appendix
Chapter 10: Hypothesis Tests Comparing Two Parameters
10.1 The Approach in this Chapter
10.2 Hypothesis Tests of Two Means
10.3 Hypothesis Tests of Two Variances
10.4 Hypothesis Tests of Two Proportions
10.5 Summary
Technical Appendix
Chapter 11: Simple Linear Regression
11.1 The Population Regression Model
11.2 A Look at the Data
11.3 Ordinary Least Squares (OLS)
11.4 The Distribution of and
11.5 Tests of Significance
11.6 Goodness of Fit
11.7 Checking for Violations of the Assumptions
11.8 Summary
Technical Appendix
Chapter 12: Multiple Regression
12.1 Population Regression Model
12.2 The Data
12.3 Sample Regression Function
12.4 Interpreting the Estimates
12.5 Prediction
12.6 Tests of Significance
12.7 Goodness of Fit
12.8 Multicollinearity
12.9 Summary
Technical Appendix
Chapter 13: More Topics in Regression
13.1 Hypothesis Tests Comparing Two Means With Regression
13.2 Hypothesis Tests Comparing More Than Two Means (ANOVA)
13.3 Interacting Variables
13.4 Nonlinearities
13.5 Time-Series Analysis
13.6 Summary
Index
End User License Agreement
xiii
xiv
xv
xvi
xvii
1
2
3
4
5
6
7
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
89
90
91
92
93
94
95
96
97
98
99
100
101
103
104
105
106
107
108
109
110
111
112
113
114
115
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
165
166
167
168
169
170
171
172
173
174
175
176
177
179
180
181
182
Cover
Table of Contents
Preface
Begin Reading
Chapter 1: Types of Data
Figure 1.1 Number of votes for each party in U.S. presidential elections after World War II.
Chapter 2: Populations and Samples
Figure 2.1 A comparison of populations of interest for political polls between the primary elections and the general election.
Figure 2.2 The Figure illustrates a
simple random sample
of size 100 taken from the population of Titantic passengers. indicates a survivor. In this example, 35 of the 100 passengers sampled survived the Titantic crash.
Figure 2.3 The Figure illustrates a
stratified sample
of size 100 taken from the population of Titantic passengers. An indicates a survivor. In this sample, 4 of 5 children survived, 15 of 19 women survived, and 14 of 76 men survived for a total of 33/100.
Chapter 3: Descriptive Statistics
Figure 3.1 Crime rates in the United States in 2014.
Figure 3.2 Exam grades from a business statistics course with 92 students.
Chapter 4: Probability
Figure 4.1 All possible outcomes (events) from one roll of a six-sided die.
Figure 4.2 Venn diagram showing the intersection for drawing either an Ace or a Red card.
Chapter 5: The Normal Distribution
Figure 5.1 The Figure illustrates a smooth normal distribution (dark line) drawn over a histogram of data for a variable .
Figure 5.2 The Empirical Rule for normally distributed data.
Figure 5.3 Beyond the Empirical Rule: Finding an area under the normal distribution.
Figure 5.4 A section of the cumulative -table.
Figure 5.5 Probability of getting 30 heads or less from 50 flips of a coin.
Chapter 6: Sampling Distributions
Figure 6.1 Population distribution of commute times.
Figure 6.2 Population and sampling distributions of the mean.
Figure 6.3 Population income data of adult Americans.
Figure 6.4 Population distribution and sampling distribution () of course grades in statistics.
Figure 6.5 Sampling distributions for proportions when at different sample sizes.
Chapter 7: Confidence Intervals
Figure 7.2 A comparison of the -distribution with the -distribution when sample sizes are small. When sample sizes get bigger, the -distribution converges to the -distribution.
Figure 7.1 95% confidence intervals for the sampling distribution of the mean.
Chapter 8: Hypothesis Tests of a Population Mean
Figure 8.1 Two-tail hypothesis test critical values and rejection regions at the 0.05 significance level.
Figure 8.2 Two-tail hypothesis test critical values and rejection regions at the 0.01 significance level.
Figure 8.3 One-tail hypothesis test critical values and rejection regions at the 0.05 significance level.
Figure 8.4 Finding the probability of a test statistic farther into the tail than .
Chapter 9: Hypothesis Tests of Categorical Data
Figure 9.1 Sampling distribution of the proportion of blue M&M's for samples of 50.
Figure 9.2 Two-tail hypothesis test critical values and rejection regions at the 0.05 significance level.
Figure 9.3 One-tail hypothesis test critical values and rejection region at the 0.05 significance level.
Figure 9.4 Chi-square distribution and critical value at the 0.05 significance level with three degrees of freedom.
Chapter 10: Hypothesis Tests Comparing Two Parameters
Figure 10.1. A general two-tail hypothesis test comparing two population means.
Figure 10.2
Figure 10.3 Two-tail hypothesis test comparing two population means at the 0.05 significance level and = 73.
Figure 10.4
Figure 10.5 Two-tail -test of two population variances at =0.05.
Figure 10.6 Folded -test of two population variances at =0.05.
Figure 10.7 Hypothesis tests of two proportions at =0.05.
Chapter 11: Simple Linear Regression
Figure 11.1 Course grades (Y) versus classes attended (X).
Figure 11.2
Figure 11.3 Scatterplot of course grades and attendance with the fitted regression line.
Figure 11.4 Test of significance of the slope at .
Figure 11.5 Histogram of the 40 residuals for grades and attendance.
Figure 11.6 A violation of the assumption that the errors are distributed normally (the residuals appear to be left-skewed).
Figure 11.7 Residuals plotted against the number of classes attended.
Figure 11.8 A violation of the homoskedasticity assumption that the errors have constant variance (the residuals appear to be heteroskedastic since deviations increase as increases).
Chapter 12: Multiple Regression
Figure 12.1 Regression output for course grades as a function of a vector of independent variables.
Figure 12.2 Regressing attendance on the other independent variables.
Figure 12.3 Regression results on home prices.
Figure 12.4 Regression results on home prices with a number of bathrooms and a number of bathroom sinks.
Chapter 13: More Topics in Regression
Figure 13.1 Comparing average starting salaries between economics and accounting majors using regression.
Figure 13.2 Comparing average starting salaries using regression.
Figure 13.3 Wage as a function of experience and gender.
Figure 13.4 Sample regression functions for both males and females.
Figure 13.5 Potential nonlinear relationships between expected (vertical axes) and (horizontal axes).
Figure 13.6 Women's long jump world records.
Chapter 1: Types of Data
Table 1.1 Student characteristics from an undergraduate course in business statistics
Table 1.2 American presidential election voting results (in millions) post World War II
Chapter 2: Populations and Samples
Table 2.1 A list of useful data sources found online
Chapter 3: Descriptive Statistics
Table 3.1 Violent crime rates in the United States in 2014 (per 100,000 people)
Table 3.2 Minutes required to read a short passage
Chapter 4: Probability
Table 4.1 All possible pairs from simultaneously rolling two dice
Table 4.2 Rolling a 2 from a pair of dice
Table 4.3 Rolling a 3 from a pair of dice
Table 4.4
Titanic
passengers and crew
Table 4.5
Titanic
survivors
Table 4.6 Conditional probabilities of surviving the
Titanic
's maiden voyage
Table 4.7 Contingency Table of cancer prevalence and test results
Chapter 5: The Normal Distribution
Table 5.1 Cumulative standard normal Table – negative -scores
Chapter 6: Sampling Distributions
Table 6.1 Commute time for a population of six workers
Table 6.2 All 20 unique samples of size three
Table 6.3 A sampling distribution of the mean
Chapter 7: Confidence Intervals
Table 7.2 Critical values
Table 7.1 Comparing 95% confidence intervals using critical and values
Chapter 8: Hypothesis Tests of a Population Mean
Table 8.1 Caffeine content in 16 oz cups (in mgs) for a sample of 50
Chapter 9: Hypothesis Tests of Categorical Data
Table 9.1 Random sample of 50 M&M's (1 = Blue; 0 = Not Blue)
Table 9.2 Observed gender and political party affiliation from a sample of =149,192
Table 9.3 Expected gender and political party affiliation from a sample of =149,192
Table 9.4 Number of M&M's by color in a sample of 50 and the expected number
Table 9.5 Chi-square (
x
2
) critical values.
Chapter 10: Hypothesis Tests Comparing Two Parameters
Table 10.1 Starting salaries from samples of recent graduates in economics and accounting
Table 10.2 Hours of sleep for the same participants with and without consuming caffeine in the day
Chapter 11: Simple Linear Regression
Table 11.1 Sample data from 40 students of their course grade (Y) and the number of classes attended (X)
Chapter 12: Multiple Regression
Table 12.1 Housing prices as a function of square footage, lot size, and the number of bathrooms (6 of 30 observations shown)
Table 12.2 Housing prices as a function of square footage, lot size, the number of bathrooms, and number of bathroom sinks (6 of 30 observations shown)
Chapter 13: More Topics in Regression
Table 13.1 Starting salaries of economics and accounting majors using a dummy variable (10 of 75 observations)
Table 13.2 GPAs for a sample of geeks, dweebs, and nerds (10 of 50 observations shown)
David M. McEvoy
This edition first published 2018
© 2018 John Wiley & Sons, Inc.
All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, recording or otherwise, except as permitted by law. Advice on how to obtain permission to reuse material from this title is available at http://www.wiley.com/go/permissions.
The right of David M. McEvoy to be identified as the author of this work has been asserted in accordance with law.
Registered Office
John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, USA
Editorial Office
111 River Street, Hoboken, NJ 07030, USA
For details of our global editorial offices, customer services, and more information about Wiley products visit us at www.wiley.com.
Wiley also publishes its books in a variety of electronic formats and by print-on-demand. Some content that appears in standard print versions of this book may not be available in other formats.
Limit of Liability/Disclaimer of Warranty
The publisher and the authors make no representations or warranties with respect to the accuracy or completeness of the contents of this work and specifically disclaim all warranties; including without limitation any implied warranties of fitness for a particular purpose. This work is sold with the understanding that the publisher is not engaged in rendering professional services. The advice and strategies contained herein may not be suitable for every situation. In view of on-going research, equipment modifications, changes in governmental regulations, and the constant flow of information relating to the use of experimental reagents, equipment, and devices, the reader is urged to review and evaluate the information provided in the package insert or instructions for each chemical, piece of equipment, reagent, or device for, among other things, any changes in the instructions or indication of usage and for added warnings and precautions. The fact that an organization or website is referred to in this work as a citation and/or potential source of further information does not mean that the author or the publisher endorses the information the organization or website may provide or recommendations it may make. Further, readers should be aware that websites listed in this work may have changed or disappeared between when this works was written and when it is read. No warranty may be created or extended by any promotional statements for this work. Neither the publisher nor the author shall be liable for any damages arising here from.
Library of Congress Cataloguing-in-Publication Data:
Names: McEvoy, David M. (David Michael), author.
Title: A guide to business statistics / by David M. McEvoy.
Description: Hoboken, NJ : John Wiley & Sons, Inc., 2018. | Includes bibliographical references and index. |
Identifiers: LCCN 2017051197 (print) | LCCN 2017054561 (ebook) | ISBN 9781119138365 (pdf) | ISBN 9781119138372 (epub) | ISBN 9781119138358 (pbk.)
Subjects: LCSH: Commercial statistics.
Classification: LCC HF1017 (ebook) | LCC HF1017 .M37 2018 (print) | DDC 519.5-dc23
LC record available at https://lccn.loc.gov/2017051197
Cover Design: Wiley
Cover Image: Derivative of “Rock Climbing in Joshua Tree National Park“ by Contributor7001 is licensed under CC BY-SA
Dedicated to my students who managed to stay awake during class, and to my family who are clearly a few standard deviations above the mean: Marta, Leo, Sofia, and Oscar
When the Boston Red Sox traded Babe Ruth to the New York Yankees in 1919, they were one of the most successful baseball teams in history. At that time, the Red Sox held five World Series titles, with the most recent in 1918. That trade would start an 86-year dry spell for the Red Sox, during which they would not win a single national title. That trade would start what baseball fans know as the Curse of the Bambino. The Curse supposedly made Johnny Pesky hesitate at shortstop in a routine throw home in game seven of the 1946 World Series. The Curse showed up when Bob Stanley threw a wild pitch in game six of the 1986 World Series that let the tying run in, and stayed to see Bill Buckner let a ground ball pass between his legs at first base. The Red Sox finally broke the curse in 2004 beating the St. Louis Cardinals. How did the Boston Red Sox break the Curse of the Bambino? Statistics.
Ok, perhaps attributing the Red Sox's 2004 title and the two that followed entirely to statistics is a bit of a reach. Statistics, however, played a role. In 2002, Theo Epstein was hired as the general manager (GM) for the Red Sox. He was the youngest GM in the history of major league baseball. Epstein relied heavily on statistics when building team rosters and making managerial decisions. He was an early adopter of what is called sabermetrics – which is a statistical analysis of baseball. His approach focused on utilizing undervalued players, including those who were on the verge of leaving the game because no other team would sign them. The movement was away from flashy players with big risks and big rewards to the more inconspicuous workhorses. It worked. Of course, it is possible that Theo Epstein and the Boston Red Sox just got lucky. Consider, however, that Theo Epstein was hired as the President of Baseball Operations for the Chicago Cubs in 2011. In 2016, the Cubs would win their first World Series in 108 years. It would end yet another curse – the Curse of the Billy Goat – that prevented the Cubs from winning for 71 years. Again, statistics.
Over the past dozen years, I have taught courses in business statistics to thousands of undergraduate students. As an instructor, one of the challenges with teaching statistics is trying to convince students that the material is important. I usually take two approaches. The first is to persuade students that they need to understand statistics as consumers of information. We are bombarded with information every single day and it is coming at us from every direction. Our news sources and social media platforms are crawling with statistics. On a Monday, I may learn that coffee is good for me and by Wednesday it is now the kiss of death. In the 1980s, eggs were cholesterol-filled heart attack triggers and today they are considered the perfect food. On any given day, I can read about studies that tell me how to live longer, run faster, have more energy, make more money, be a better parent, and be happier. These types of studies all rely on statistics. Some of the information we get is from scientific studies – those that rely on the scientific method – but other information is very ad hoc. Understanding what the statistics tell us, how they are calculated, and the samples they are derived from is key to processing all of the information we consume. Understanding statistics can help you pick out the nuggets of useful information from the big mess of the modern information age.
The other approach I take in trying to convey the importance of statistics is to appeal to the students as producers of information. It is probably safe to assume that most people do not enjoy cranking through formulas and pouring over spreadsheets of data. However, everyone is interested in something. Perhaps, you are interested in investing in the stock market and you need to decide which firms to invest in. Maybe you need to convince your boss which social media platform to advertise on. Maybe you need to persuade your parents that spending a semester studying abroad is a useful experience. The point is that everyone has interesting questions, and answering those questions usually requires some form of data analysis. Just having data is not enough, you need to know how to release its secrets.
The second challenge with teaching statistics is that, in my experience, many students dread the thought of the subject, and often walk through the door the first day of class already resigned to the idea that they will hate it. Typically, students believe that they will dislike statistics because they consider the subject too hard, or it requires too much math. Adding to the list of students' fears and concerns is the fact that most of the materials created for undergraduate courses in business statistics try to accomplish too many things and as a result are overwhelming. Textbooks try to balance a mix of theory, intuition, formulas, case studies, datasets, applets, problem sets, and the practical use of particular software programs. All of these are important objectives, but when blended together each tends to get crowded out. In my experience, students use their statistics textbooks as reference guides to look up formulas or functions, but in the process miss the fundamental concepts and intuition.
The objective of this book is to try help ease both of these challenges. The goal of each chapter is to first motivate a particular section of business statistics and then walk through the concepts in an intuitive fashion. The book is driven by examples and many of the examples span over multiple chapters. The book was written with a goal of removing many of the distractions students encounter in their statistics textbooks. Mathematical formulas and much of the notation are relegated to technical appendices at the end of each chapter. There are no online applets, data downloads, or breakout case studies. The prose is written so that it is hopefully inviting to students with different backgrounds and experiences. The focus is more on developing intuition and understanding the fundamentals than it is on being a comprehensive catalog of statistical tests.
This book is not designed to be used as a primary source of information for an undergraduate statistics course. It does not cover every figure, statistic, or hypothesis test you will find in a comprehensive textbook. It is meant to be a supplement to a more detailed textbook and/or a set of lecture notes. It should be thought of as a companion guide with the goal of helping students get a better grasp on the fundamentals. In this way, the primary textbook serves as the comprehensive catalog of information and, perhaps, the source of assessment materials, while A Guide to Business Statistics serves as the source for students to strengthen their intuition about the concepts and their applicability. However, for classes in which the instructor provides all the required technical details in the lecture notes and does not rely on a textbook to assign problems, homework, or practice datasets, A Guide to Business Statistics can serve as a primary textbook. In these cases, students will read the book to complement the material covered in lecture with the goal of providing an intuitive and example-driven approach to better understand the material. This book maintains the level of rigor of a standard textbook in business statistics, but with a more streamlined approach and accessible explanation of the material.
It is not surprising that most students do not read their undergraduate statistics textbooks in a linear fashion. If anything, they tend to skim through the pages in search for formulas, tables, or functions. The chapters in most statistics textbooks are very difficult to read from start to finish, and to be fair they are not designed for that approach. This book is designed to be easy to read and, most importantly, concise. Students should open a chapter and read it from start to finish and at the end have a good understanding of the core concepts for that section. The chapters include examples, simple tables and figures, and a technical appendix with the formulas. At the end of each chapter (before the appendix), the key elements are reinforced in a brief summary paragraph. To maintain its readability in a linear fashion, it purposefully avoids problem sets, animations, video clips, and interactive materials.
Another important distinction between standard textbooks and A Guide to Business Statistics is the treatment of statistical software programs. Textbooks are increasingly focused on how to better integrate statistics software (e.g., Excel, SPSS, and Minitab) with the course material. This is important because students should be able to use technology to analyze data and produce statistical output. However, while many students are capable of running a statistical test in a program like Excel, there is often a lack of general understanding regarding the underlying concepts and interpretations of the results. For example, most students can successfully create a confidence interval if provided a dataset. Fewer students can correctly interpret a confidence interval, and even fewer can still explain the theorems those interpretations are grounded in. I would argue that understanding the underlying concepts in statistics is more important than learning how to use a certain software package to generate statistical output. The technology is going to change, but the concepts and theorems that are fundamental to statistics are not tied to specific platforms. This book does contain references to statistical functions in Excel, especially in the chapters on regression analysis. Software programs like Excel are absolutely required for any analysis of large datasets. The point of this book, however, is not to develop a student's skill set in any particular software program. Running a regression in Excel is just as easy as in SPSS or Minitab. The point, rather, is to help interpret the output that is produced by any software program.
The trajectory of the chapters follows most of the standard textbooks in business statistics. The coverage of the material in each chapter is designed to be more “narrow and deep” rather than “broad and shallow.” That said, in my experience, all of the key materials required in a first and second course in undergraduate business statistics are covered in this book. The first part of the book is concentrated on how we collect and describe data (Chapters 1–6) and the second half is focused on how to use sample data to make inferences about things we do not know about a population of interest (Chapters 7–13). The chapters on inferential statistics focus on parametric tests – those that assume that the data follow a particular type of distribution. These are the most common tests in business and other social sciences. The final three chapters of the book cover linear regression techniques.
This book should serve as a useful guide for all undergraduate statistics students in business and economics, regardless of the specific primary textbook (if any) they are using in their course. Almost all business and economics majors are required to complete a course in statistics, and many 4-year business programs require two courses as part of the major. In addition, most 2-year colleges offer an introductory course in statistics. When two courses are required, it is often the case that the same primary textbook is used in both courses. A Guide to Business Statistics is geared to students taking both their first and second courses in statistics. The first course is typically taken as a freshman or sophomore and the second as a junior or senior. The book, therefore, should prove useful over all four undergraduate years.
Although the book is geared toward students in higher education, it may be a helpful resource to faculty and instructors who have been away from statistics for some time. It can serve as a concise “refresher” resource for teachers and practitioners.
Steven Wright once joked that “42.7% of all statistics are made up on the spot.”1 One reason that his quip is effective is because there are good reasons to be suspicious of many of the statistics we encounter every day. Statistics are often reported as hard facts that cannot be argued with. This is not so. Statistics, and the data that the statistics are derived from, are generated by humans. Humans are not infallible and neither are the numbers reported from analyzing the data. As consumers of information, sometimes the statistics we encounter are just simply wrong or even nonsensical. There are examples of peer-reviewed publications reporting 200% reductions in some metric. Even reductions of 12,000% have been reported.2 Without even glancing at the data analyzed in these studies, we know that such statistics are nonsense. You cannot decrease anything by more than 100%. Once you lose 100% of stuff, you are out of stuff. We tend to believe assertions when they are based on data. The problem is that we often do not look carefully at what type of data is being analyzed, how the data were gathered, and whether the results are valid. To be an active and informed citizen, you need to understand a bit about how statistics are generated and what they can tell us. It all starts with understanding the type of data being analyzed, which is the focus of this first chapter.
In the broadest terms, statistics is the science of collecting, analyzing, and interpreting data. One branch of statistics is concerned with how to describe and present data in useful ways (descriptive statistics) and the other branch is concerned with how to use samples of data to draw conclusions about unknown characteristics of a larger population (inferential statistics). In either case, the starting point is understanding a bit about data. Often, when students hear the term data or data analysis, they picture some geek crunching through endless columns of numbers in search for answers. The truth is that data are simply organized information. Data does not have to be numeric, and not all numeric data can be treated the same way. One great thing about the modern state of technology and connectivity is that we have access to incredible amounts of interesting, and often peculiar, datasets. For example, you can read the last words of every executed criminal in the state of Texas since 1982.3 Or, if you think that is too morbid, you may be interested in the location, speed, age, and height of amusement park rollercoasters found all over the world.4 Perhaps, you want to rank every character on the Simpsons by the number of words they spoke between season 1 and season 26.5 The point is that there is so much data available to the public that the possibilities are endless. If you want to get weird, get weird.6 You can let your imagination lead you to data, but let this book guide you on how to analyze it.
The important point is to recognize what type of data you are working with because that will dictate the way you analyze it. In this chapter, we consider the taxonomy of different data types. To begin, all data can be broadly classified as either categorical or numerical.
Categorical data (also called qualitative data) have values described by words rather than numbers. Examples include gender, occupation, major, and location. Often, categorical data are represented with codes to make it easier to manage and manipulate. For example, a dataset that includes college majors may convert accounting = 1, economics = 2, and marketing = 3. The important distinction between these codes and numeric data is that the codes typically do not convey a ranking, they are just a way to organize categorical data. When data can be classified by two categories, we call that binary data. Examples include gender in which female = 1 and male = 0. Even when data have more than two categories, the qualitative data can often be represented in binary form. As an example, consider the three majors: accounting, economics, and marketing. If each observation in a dataset is a single student, then three binary variables (accounting, economics, and marketing) could be generated. When either of the three binary variables take a value of 1, it indicates that the student is majoring in the respective field. A 0, on the other hand, indicates that the student is not majoring in that field.
To illustrate the use of categorical data, consider the dataset in Table 1.1. The dataset includes the characteristics of students taking an undergraduate course in business statistics. The first two columns of data – Student and Dorm – are categorical. This includes the student's first name and the name of the dorm each student lives in on campus. While it may be possible to apply codes to these categorical variables (e.g., student ID's in place of names) those numbers would just be used as an alternative way to categorize data and would not reflect magnitudes or ranking.
The remaining three variables: Floor, GPA, and SAT Rank in Table 1.1 are numeric. The variable Floor denotes which floor they live on in their respective dorm. The numbers follow European conventions with 0 being the ground floor and negative numbers indicating floors below ground. The variable GPA is the student's grade point average capped at 4.0, and the variable SAT Rank ranks each student in terms of their SAT score with 1 being the student with the highest SAT score.
Table 1.1 Student characteristics from an undergraduate course in business statistics
Student
Dorm
Floor
GPA
SAT rank
Barry
Hawthorne
5
3.98
1
Cindy
Whittier
3
2.87
10
Stan
Dickinson
1
1.98
9
Donna
Dickinson
1
4.00
2
Drew
Whittier
2
3.20
5
Wilbur
Fairchild
0
2.56
6
Frank
Hawthorne
4
2.98
8
Jose
Emerson
2
3.12
7
Paul
Hawthorne
1
3.45
4
Steve
Emerson
5
3.88
3
Numerical, or quantitative, data result from some form of counting, measurement or computation. Numeric data are broken down into variables that are discrete or continuous. Discrete data are typically thought of as variables that are countable, in which fractions do not make sense. Often, these are integer values, and examples include the number of courses taken, number of credit hours earned, number of children, number of flights, and the number of absences. You may notice that the terminology “number of” often precedes the description of a discrete variable. In our dataset in Table 1.1, the variables Floor and SAT Rank are both discrete numeric variables. Clearly, the number of floors is countable and fractions of a floor do not make sense.7 The variable SAT Rank is also discrete. The SAT rankings are integer values, can be counted, and are definitely not divisible.
In contrast, continuous variables can take on any value within an interval. Continuous data are not counted, and is usually measured. With continuous data “fractions make sense.” Examples include weight, speed, height, distance, prices, and interest rates. Even if continuous data are rounded so that only integer values are reported, the data are still continuous. Age, for example, is typically reported in integer values. However, age can be measured very precisely by years, days, minutes, seconds, milliseconds, and so on. The same is usually standard with prices and other financial data. These are continuous measures that are rounded for convenience. They are not counted. The variable GPA in Table 1.1 is continuous.
In the later chapters, we sometimes blur the lines between discrete and continuous data. For example, the number of votes candidates receive in a presidential election is discrete. Why? Because votes are counted and fractions do not make sense. However, when the range of values is so large (e.g., millions of votes) that the difference between one unit (e.g., one vote) is so small, we sometimes treat discrete data to be continuous.
When data are categorical (or qualitative), the level of measurement is called nomimal. Nominal data have no meaningful order and any numbers attributed to data values are simply for coding purposes. Denoting female observations with the number 1 and male observations with the number zero is an example. The numbers are not meaningful on their own and the numbers could be substituted with any other numbers without affecting the results. Dividing your classmates into geeks, dweebs, and nerds, for instance, would require nominal measurement. Simply coding students in one category, even if it is numeric, has no meaning in terms of relative rank. The level of measurement for the two categorical variables Student and Dorm in Table 1.1 is nominal.
Data that are ordinal in nature suggest that there is a meaningful ranking among the data, but there is no clear measurement regarding the distances between values. Placement in a race for instance could be denoted as first, second, third, and so on. Without additional clarifying data, the rankings are meaningful because we know that the second place runner finished before the third place runner, but we do not know how much faster the second place runner was relative to the third place runner. Another example is placement in an Olympic event, where gold is better than silver that is better than bronze. However, those rankings do not convey how much better the gold medal winner was compared to the silver medal winner. Data on vehicle size could also be ordinal if it were classified as 3 = full size, 2 = compact, or 1 = subcompact. Clearly, in terms of size, but it is unclear how much bigger a full-size car is compared to a subcompact car. In Table 1.1, the variable SAT Rank is ordinal. The ranking indicates which student scored higher in the SAT exam (one indicating the highest grade), but it does not tell us how far the first highest score is from the second, and so on.
Interval data are numeric and have both a meaningful ranking and measurable distances between values. The defining feature of interval data is that there is no true zero. With interval data, a zero does not mean that the variable has no value. Temperature is the classic example. A temperature of zero degree Celsius does not mean there is an absence of temperature. Without a true zero, the numeric values cannot be divided or multiplied and still retain their meaning. A temperature of 20 degrees, for example, is not twice as warm as 10 degrees. The intervals between measures can be interpreted with precision (e.g., there is a 10-degree difference between 10 and 20 degrees), but we cannot say that 20 degrees is twice as warm. However, it is still possible to calculate an average with interval data (e.g., average temperature) and measures of variability. The variable Floor in Table 1.1 is interval data. A zero value does not mean the absence of a floor, it is simply a reference point. This reference point can change, for example in the United States, the ground floor of most buildings is typically a positive number. Interval data may be discrete or continuous.
The final category of measurement is ratio. Ratio data are like interval data except that there is a true zero. Examples include weight, height, speed, the number of children, number of classes, number of votes, calories, and grades. GPA is ratio data. Even though we do not observe a zero value for GPA, a value of zero is still meaningful. Ratio data may be discrete or continuous.
Another way to characterize data is by time period. When a dataset consists of observations from different individual units (e.g., people, businesses, and countries) in the same time period, we call that cross-sectional data. You can think of cross-sectional data as information taken from one single slice in time. US census data are cross-sectional since it consists of all individual households in a given year. The data in Table 1.1 are cross-sectional, because they consist of characteristics of 10 students in the same undergraduate business statistics course.
Time-series data, on the other hand, track observations over time. Often, time-series data follow one single individual unit (e.g., person, business, and country) over a time period. For example, tracking the daily Dow Jones industrial average over a period of 10 years would constitute a time-series dataset. Each observation is a different point in time (e.g., day, month, year, and decade). Another example is a dataset tracking temporal changes in a single company's stock price. Climate scientists rely on time-series data to understand trends in the average temperature of the earth and how those measurements interact with carbon emissions.
It is often useful to plot time-series data using a line chart to get a feel for specific trends, cycles, or seasons. To illustrate, consider the dataset in Table 1.2. The dataset includes voting results for every American presidential election after World War II. The data include the year, the candidate's name by party, total votes for both the democratic and republican candidates, and aggregate votes. The dataset in Table 1.2 can be considered to be time-series data. Each observation is from a different year, and the individual units are unique pairs of democratic and republican presidential candidates.
Table 1.2 American presidential election voting results (in millions) post World War II
Year
Democrat
Republican
Dem vote
Rep vote
Total vote
1948
Truman
Dewey
24.11
21.97
46.07
1952
Stevenson
Eisenhower
27.31
33.78
61.09
1956
Stevenson
Eisenhower
