103,99 €
Advanced Statistics with Applications in R fills the gap between several excellent theoretical statistics textbooks and many applied statistics books where teaching reduces to using existing packages. This book looks at what is under the hood. Many statistics issues including the recent crisis with p-value are caused by misunderstanding of statistical concepts due to poor theoretical background of practitioners and applied statisticians. This book is the product of a forty-year experience in teaching of probability and statistics and their applications for solving real-life problems. There are more than 442 examples in the book: basically every probability or statistics concept is illustrated with an example accompanied with an R code. Many examples, such as Who said pi? What team is better? The fall of the Roman empire, James Bond chase problem, Black Friday shopping, Free fall equation: Aristotle or Galilei, and many others are intriguing. These examples cover biostatistics, finance, physics and engineering, text and image analysis, epidemiology, spatial statistics, sociology, etc. Advanced Statistics with Applications in R teaches students to use theory for solving real-life problems through computations: there are about 500 R codes and 100 datasets. These data can be freely downloaded from the author's website dartmouth.edu/~eugened. This book is suitable as a text for senior undergraduate students with major in statistics or data science or graduate students. Many researchers who apply statistics on the regular basis find explanation of many fundamental concepts from the theoretical perspective illustrated by concrete real-world applications.
Sie lesen das E-Book in den Legimi-Apps auf:
Seitenzahl: 1326
Veröffentlichungsjahr: 2019
Established by Walter A. Shewhart and Samuel S. Wilks
Editors: David J. Balding, Noel A. C. Cressie, Garrett M. Fitzmaurice, Geof H. Givens, Harvey Goldstein, Geert Molenberghs, David W. Scott, Adrian F. M. Smith, Ruey S. Tsay
Editors Emeriti: J. Stuart Hunter, Iain M. Johnstone, Joseph B. Kadane, Jozef L. Teugels
The Wiley Series in Probability and Statistics is well established and authoritative. It covers many topics of current research interest in both pure and applied statistics and probability theory. Written by leading statisticians and institutions, the titles span both state‐of‐the‐art developments in the field and classical methods.
Reflecting the wide range of current research in statistics, the series encompasses applied, methodological and theoretical statistics, ranging from applications and new techniques made possible by advances in computerized practice to rigorous treatment of theoretical approaches. This series provides essential and invaluable reading for all statisticians, whether in academia, industry, government, or research.
Eugene Demidenko
Dartmouth College
This edition first published 2020
© 2020 John Wiley & Sons Inc.
All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, recording or otherwise, except as permitted by law. Advice on how to obtain permission to reuse material from this title is available at http://www.wiley.com/go/permissions.
The right of Eugene Demidenko to be identified as the author of this work has been asserted in accordance with law.
Registered Office
John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, USA
Editorial Office
111 River Street, Hoboken, NJ 07030, USA
For details of our global editorial offices, customer services, and more information about Wiley products visit us at www.wiley.com.
Wiley also publishes its books in a variety of electronic formats and by print‐on‐demand. Some content that appears in standard print versions of this book may not be available in other formats.
Limit of Liability/Disclaimer of Warranty
While the publisher and authors have used their best efforts in preparing this work, they make no representations or warranties with respect to the accuracy or completeness of the contents of this work and specifically disclaim all warranties, including without limitation any implied warranties of merchantability or fitness for a particular purpose. No warranty may be created or extended by sales representatives, written sales materials or promotional statements for this work. The fact that an organization, website, or product is referred to in this work as a citation and/or potential source of further information does not mean that the publisher and authors endorse the information or services the organization, website, or product may provide or recommendations it may make. This work is sold with the understanding that the publisher is not engaged in rendering professional services. The advice and strategies contained herein may not be suitable for your situation. You should consult with a specialist where appropriate. Further, readers should be aware that websites listed in this work may have changed or disappeared between when this work was written and when it is read. Neither the publisher nor authors shall be liable for any loss of profit or any other commercial damages, including but not limited to special, incidental, consequential, or other damages.
Library of Congress Cataloging‐in‐Publication Data
Names: Demidenko, Eugene, 1948‐ author.
Title: Advanced statistics with applications in R / Eugene Demidenko
(Dartmouth College).
Description: Hoboken, NJ : Wiley, 2020. | Series: Wiley series in probability
and statistics | Includes bibliographical references and index. |
Identifiers: LCCN 2019015124 (print) | LCCN 2019019543 (ebook) | ISBN
9781118594131 (Adobe PDF) | ISBN 9781118594612 (ePub) | ISBN 9781118387986
(hardback)
Subjects: LCSH: Mathematical statistics‐Data processing‐Problems,
exercises, etc. | Statistics‐Data processing‐Problems, exercises, etc. |
R (Computer program language)
Classification: LCC QA276.45.R3 (ebook) | LCC QA276.45.R3 D4575 2019 (print)
| DDC 519.5‐dc23
LC record available at https://lccn.loc.gov/2019015124
Cover design by Wiley
Cover image: Courtesy of Eugene Demidenko
To my family
My favorite part of the recent American Statistical Association (ASA) statement on the ‐value [103] is how it starts: “Why do so many people still use as a threshold?” with the answer “Because that's what they were taught in college or grad school.” Many problems in understanding and the interpretation of statistical inference, including the central statistical concept of the ‐value arise from the shortage of textbooks in statistics where theoretical and practical aspects of statistics fundamentals are put together. On one hand, we have several excellent theoretical textbooks including Casella and Berger [17], Schervish [87], and Shao [94] without single real‐life data example. On the other hand, there are numerous recipe‐style statistics textbooks where theoretical considerations, assumptions, and explanations are minimized. This book fills that gap.
Statistical software has become so convenient and versatile these days that many use it without understanding the underlying principles. Unfortunately, R packages do not explain the algorithms and mathematics behind computations, greatly contributing to a superficial understanding making statistics too easy. Many times, to my question “How did you compute this, what is the algorithm,” I hear the answer, “I found a program on the Internet.” Hopefully, this book will break the unwanted trend of such statistics consumption.
I have often been confronted with the question comparing statistics with driving a car: “Why do we need to know how the car works?” Well, because statistics is not a car: the chance of the car breaking is slim, but starting with the wrong statistical analysis is almost guaranteed without solid understanding of statistics background and implied limitations. In this book, we look at what is under the hood.
Each term I start my first class in statistics at Dartmouth with the following statement:
“Mathematics is the queen and statistics is the king of all sciences”
Indeed, mathematics is the idealistic model of the world: one line goes through a pair of points, the perimeter of a polygon converges to when the number of edges goes to infinity, etc. Statistics fills mathematics with life. Due to an unavoidable measurement error, one point turns into a cloud of points. How does one draw a line through two clouds of points? How does one measure in real life? This book starts with a motivating example “Who said ” in which I suggest to measuring by taking the ratio of the perimeter of the tire to its diameter. To the surprise of many, the average ratio does not converge to even if the measurement error is very small. The reader will learn how this seemingly easy problem of estimating turns into a formidable statistical problem. Statistics is where the rubber meets the road. It is difficult to name a science where statistics is not used.
Examples are a big deal in this book (there are 442 examples in the book). I follow the saying: “Examples are the expressway to knowledge.” Only examples show how to use theory and how to solve a real‐life problem. Too many theories remain unusable.
Today statistics is impossible without programming: that is why R is the language statisticians speak. The era of statistics textbooks with tables of distributions in an appendix is gone. Simulations are a big part of probability and statistics: they are used to set up a probabilistic model, test the analytical answer, and help us to study small‐sample properties. Although the speed of computations with the for loop have improved due to 64‐bit computing, vectorized simulations are preferable and many examples use this approach.
Regarding the title of the book, “Advanced Statistics” is not about doing more mathematics, but an advanced understanding of statistical concepts from the perspective of applications. Statistics is an applied science, and this book is about statistics in action. Most theoretical considerations and concepts are either introduced or applied to examples everybody understands, such as mortgage failure, an oil spill in the ocean, gender salary discrimination, the effect of a drug treatment, cancer distribution in New Hampshire, etc.
I again turn the reader's attention to the ‐value. This concept falls through the crack of statistical science. I have seen many mathematical statisticians who work in the area of asymptotic expansions and are incapable of explaining the ‐value in layman's terms. I have seen many applied statisticians who mostly use existing statistical packages and describe the ‐value incorrectly. The goal of this book is to rigorously explain statistical concepts, including the ‐value, and illustrate them with concrete examples dependent on the purpose of statistics applications (I suggest an impatient reader jump to Section 7.10 and then Section 8.5). I emphasize the difference between parameter‐ and individual‐based statistical inference. While classical statistics is concerned with parameters, in real‐life applications, we are mostly concerned with individual prediction. For example, given a random sample of individual incomes in town, the classical statistics is concerned with estimation of the town mean income (phantom parameter) and the respective confidence interval, but often we are interested in a more practical question. In what range does the income of a randomly asked resident belong with given probability? This distinction is a common theme of the book.
This book is intended for graduate students in statistics, although some sections are accessible for senior undergraduate statistics students with a solid mathematical background in multivariate calculus and linear algebra along with some courses in elementary statistics and probability. I hope that researchers will find this book useful as well to clarify important statistical concepts.
I am indebted to Steve Quigley, former associate publisher at Wiley, for pursuing me to sign the contract on writing a textbook in statistics. Several people read parts of the book and made helpful comments: Senthil Girimurugan; my Dartmouth students James Brofos, Michael Downs, Daniel Kang; and my colleagues, Dan Rockmore, Zhigang Li, James O'Malley and Todd MacKenzie, among others. I am thankful to anonymous reviewers for their thoughts and corrections that improved the book. Finally, I am grateful to John Morris of Editide (http://www.editide.us/x) for his professional editorial service.
Data sets and R codes can be downloaded at my website:
www.dartmouth.edu/~eugened
I suggest that they be saved on the hard drive in the directoryC: ∖StatBook∖. The codes may be freely distributed and modified.
I would like to hear comments, suggestions and opinions from readers. Please e‐mail me at [email protected].
Eugene Demidenko
Dartmouth College
Hanover, New Hampshire
August 2019
Two types of random variables are distinguished: discrete and continuous. Theoretically, there may be a combination of these two types, but it is rare in practice. This chapter covers discrete distributions and the next chapter will cover continuous distributions.
In univariate calculus, a variable takes values on the real line and we write In probability and statistics, we also deal with variables that take values in Unlike calculus, we do not know exactly what value it takes. Some values are more likely and some values are less likely. These variables are called random. The idea that there is uncertainty in what value the variable takes was uncomfortable for mathematicians at the dawn of the theory of probability, and many refused to recognize this theory as a mathematical discipline. To convey information about a random variable, we must specify its distribution and attach a probability or density for each value it takes. This is why the concept of the distribution and the density functions plays a central role in probability theory and statistics. Once the density is specified, calculus turns into the principal tool for treatment.
Throughout the book we use letters in uppercase and lowercase with different meaning: denotes the random variable and denotes a value it may take. Thus indicates the event that random variable takes value For example, we may ask what is the chance (probability) that takes value In mathematical terms, For a continuous random variable, we may be interested in the probability that a random variable takes values less or equal to or takes values from the interval
A complete coverage of probability theory is beyond the scope of this book – rather, we aim to discuss only those features of probability theory that are useful in statistics. Readers interested in a more rigorous and comprehensive account of the theory of probability are referred to classic books by Feller [45] or Ross [83], among many others.
In the following example, we emphasize the difference between calculus, which assumes that the world is deterministic, and probability and statistics, which assume that the world is random. This difference may be striking.
Who said The ratio of a circle's circumference to its diameter is To test this fact, you may measure the circumference of tires and their diameters from different cars and compute the ratios. Does the average ratio approach as the number of measured tires goes to infinity?
Perhaps to the reader's surprise, even if there is a slight measurement error of the diameter of a tire, the average of empirically calculated ’s does not converge to the theoretical value of ; see Examples 3.36 and 6.126. In order to obtain a consistent estimator of we have to divide the sum of all circumferences by the sum of all diameters. This method is difficult to justify by standard mathematical reasoning because tires may come from different cars.
This example amplifies the difference between calculus and probability and statistics. The former works in an ideal environment: no measurement error, a unique line goes through two points, etc. However, the world we live in is not perfect: measurements do not produce exactly a theoretically expected result, points do not fall on a straight line, people answer differently to the same question, patients given the same drug recover and some not, etc. All laws of physics including the Newton's free fall formula (see Example 9.4) do not exactly match the empirical data. To what extent can the mismatch can be ignored? Do measurements confirm the law? Does the Newton's theory hold? These questions cannot be answered without assuming that the measurements made (basically all data) are intrinsically random. That is why statistics is needed every time data are analyzed.
The Bernoulli random variable is the simplest random variable with two outcomes, such as yes and no, but sometimes referred to as success and failure. Nevertheless, this variable is a building block of all probability theory (this will be explained later when the central limit theorem is introduced).
Generally, we divide discrete random variables into two groups with respect to how we treat the values they take:
Cardinal (or numerical).
The variables take numeric values and therefore can be compared (inequality
is meaningful), and the arithmetic is allowed. Examples of cardinal discrete random variables include the number of children in the family and the number of successes in a series of independent Bernoulli experiments (the binomial random variable). If a random variable takes values 0, 1, and
then
; the arithmetic mean is meaningful for the cardinal random variables.
Nominal (or categorical).
These variables take values that are not numeric but merely indicate the name/label or the state (category). For example, if we are talking about three categories, we may use quotes “1,” “2,” or “3” if names are not provided. An example of a nominal discrete random variable is the preference of a car shopper among car models “Volvo,” “Jeep,” “VW,” etc. Although the probabilities for each category can be specified, the milestone probability concepts such as the mean and the cumulative distribution function make no sense. Typically, we will be dealing with cardinal random variables. Formally, the Bernoulli random variable is nominal, but with only two outcomes, we may safely code
yes
as 1 and
no
as 0. Then the average of Bernoulli outcomes is interpreted as the proportion of having a
yes
outcome. Variables may take finite or infinite number of values. An example of a discrete random variable that may take an infinite number of values is a Poisson random variable, discussed in
Section 1.7
. Sometimes, it is convenient to assume that a variable takes an infinite number of values even in cases when the number of cases is bounded, such as in the case of the number of children per family.
An example of a binary (or dichotomous) random variable is the answer to a question such as “Do you play tennis?” (it is assumed that there are only two answers, yes and no). As was noted earlier, without loss of generality, we can encode yes as 1 and no as 0. If codes the answer, we cannot predict the answer – that is why is a random variable. The key property of is the probability that a randomly asked person plays tennis (clearly, the probability that a randomly asked person does not play tennis is complementary). Mathematically we write The distribution of a binary random variable is completely specified by . An immediate application of the probability is that, assuming that a given community consists of people, we can estimate the number of tennis players as
We refer to this kind of binary variable as a Bernoulli random variable named after the Swiss mathematician Jacob Bernoulli (1654–1705). We often denote (complementary probability ), so that A compact way to write down the Bernoulli probability of possible outcomes is
where takes fixed values, 1 or 0. This expression is useful for deriving the likelihood function for statistical purposes that will be used later in the statistics part of the book.
The next example applies the Bernoulli random variable to a real‐world problem.
Safe driving. Fred is a safe driver: he has a chance each year of getting a traffic ticket. Is it true that he will get at least one traffic ticket over 20 years of driving?
Solution. Many people say yes. Indeed, since the probability for one year is 1/10, the probability that he will get a traffic ticket over 20 years is more than 1 and some people would conclude that he will definitely get a ticket. First, this naive computation is suspicious: How can a probability be greater than 1? Second, if he is lucky, he may never get a ticket over 20 years because getting a ticket during one year is just a probability, and the event may never occur this year, next year, etc. To find the probability that Fred will get at least one ticket, we use the method of complementary probability and find the probability that Fred gets no ticket over 20 years. Since the probability to get no ticket each year is the probability to get no tickets over 20 years is Finally, the probability that Fred gets at least one ticket over 20 years is In other words, the probability to be ticket‐free over 20 years is greater than 10%. This is a fun problem and yet reflects an important phenomenon in our life: things may happen at random and scientific experiments may not be reproducible with positive probability.
Check formula
1.1
by examination. [Hint: Evaluate the formula at
and
Demonstrate that the naive answer in Example 1.2 can be supported by the approximation formula
for small
and
(a) Derive this approximation using the L'Hôpital's rule, and (b) apply it to the probability of getting at least one ticket.
Provide an argumentation for the infinite monkey theorem: a monkey hitting keys at random on a computer keyboard for an infinite amount of time will almost surely type a given text, such as “Hamlet” by William Shakespeare (make the necessary assumptions). [Hint: The probability of typing the text starting from any hit is the same and positive; then follow Example 1.2.]
Classical probability theory uses cardinal (numeric) variables: these variables take numeric values that can be ordered and manipulated using arithmetic operations such as summation. For a discrete numeric random variable, we must specify the probability for each unique outcome it takes. It is convenient to use a table to specify its distribution as follows.
Value of
Probability
It is assumed that are all different and the events are mutually exclusive; sometimes the set is called the sample space and particular the outcome or elementary event. Without loss of generality, we will always assume that the values are in ascending order, Indeed, if some are the same, we sum the probabilities. As follows from this table, may take values and
sometimes referred to as the probability mass function (pmf). Since are probabilities and is an exhaustive set of values, we have
For a categorical random variable can be interpreted as a Bernoulli random variable . An example of a categorical random variable with a number of outcomes more than two is a voter's choice in an election, assuming that there are three or more candidates. This is not a cardinal random variable: the categories cannot be arranged in a meaningful order and arithmetic operations do not apply.
An example of a discrete random variable that may take any nonnegative integer value, at least hypothetically, is the number of children in a family. Although practically this variable is bounded (for instance, one may say that the number of children is less than 100), it is convenient to assume that the number of children is unlimited. It is customary to prefer convenience over rigor in statistical applications.
Sometimes we want to know the probability that a random variable takes a value less or equal to . This leads to the concept of the cumulative distribution function (cdf).
The cumulative distribution function is defined as
The cdf is a step‐wise increasing function; the steps are at The cdf is convenient for finding the probability of an interval event. For example,
where and are fixed numbers (. We will discuss computation of the cdf in R for some specific discrete random variables later in this chapter.
Figure 1.1