21,59 €
Solve real-world statistical problems using the most popular R packages and techniques
R is a popular programming language for developing statistical software. This book will be a useful guide to solving common and not-so-common challenges in statistics. With this book, you'll be equipped to confidently perform essential statistical procedures across your organization with the help of cutting-edge statistical tools.
You'll start by implementing data modeling, data analysis, and machine learning to solve real-world problems. You'll then understand how to work with nonparametric methods, mixed effects models, and hidden Markov models. This book contains recipes that will guide you in performing univariate and multivariate hypothesis tests, several regression techniques, and using robust techniques to minimize the impact of outliers in data.You'll also learn how to use the caret package for performing machine learning in R. Furthermore, this book will help you understand how to interpret charts and plots to get insights for better decision making.
By the end of this book, you will be able to apply your skills to statistical computations using R 3.5. You will also become well-versed with a wide array of statistical techniques in R that are extensively used in the data science industry.
If you are a quantitative researcher, statistician, data analyst, or data scientist looking to tackle various challenges in statistics, this book is what you need! Proficiency in R programming and basic knowledge of linear algebra is necessary to follow along the recipes covered in this book.
Francisco Juretig has worked for over a decade in a variety of industries such as retail, gambling and finance deploying data-science solutions. He has written several R packages, and is a frequent contributor to the open source community.Sie lesen das E-Book in den Legimi-Apps auf:
Seitenzahl: 382
Veröffentlichungsjahr: 2019
Copyright © 2019 Packt Publishing
All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.
Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing or its dealers and distributors, will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book.
Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.
Commissioning Editor:Pravin DhandreAcquisition Editor: Devika BattikeContent Development Editor:Athikho Sapuni RishanaTechnical Editor:Utkarsha S. KadamCopy Editor: Safis EditingProject Coordinator:Kirti PisatProofreader: Safis EditingIndexer:Priyanka DhadkeGraphics:Jisha ChirayilProduction Coordinator:Arvindkumar Gupta
First published: March 2019
Production reference: 1280319
Published by Packt Publishing Ltd. Livery Place 35 Livery Street Birmingham B3 2PB, UK.
ISBN 978-1-78980-256-6
www.packtpub.com
Mapt is an online digital library that gives you full access to over 5,000 books and videos, as well as industry leading tools to help you plan your personal development and advance your career. For more information, please visit our website.
Spend less time learning and more time coding with practical eBooks and Videos from over 4,000 industry professionals
Improve your learning with Skill Plans built especially for you
Get a free eBook or video every month
Mapt is fully searchable
Copy and paste, print, and bookmark content
Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.packt.com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at [email protected] for more details.
At www.packt.com, you can also read a collection of free technical articles, sign up for a range of free newsletters, and receive exclusive discounts and offers on Packt books and eBooks.
Francisco Juretig has worked for over a decade in a variety of industries such as retail, gambling and finance deploying data-science solutions. He has written several R packages, and is a frequent contributor to the open source community.
Davor Lozic is a senior software engineer interested in a variety of subjects, in particular, computer security, algorithms, and data structures. He manages teams of more than 15 engineers and is a professor when it comes to teaching what there is to know about database systems. You can contact him at [email protected]. He likes cats! If you want to talk about any aspect of technology, or if you have funny pictures of cats, feel free to contact him.
If you're interested in becoming an author for Packt, please visit authors.packtpub.com and apply today. We have worked with thousands of developers and tech professionals, just like you, to help them share their insight with the global tech community. You can make a general application, apply for a specific hot topic that we are recruiting an author for, or submit your own idea.
Title Page
Copyright and Credits
R Statistics Cookbook
About Packt
Why subscribe?
Packt.com
Contributors
About the author
About the reviewer
Packt is searching for authors like you
Preface
Who this book is for
What this book covers
To get the most out of this book
Download the example code files
Download the color images
Conventions used
Sections
Getting ready
How to do it…
How it works…
There's more…
See also
Get in touch
Reviews
Getting Started with R and Statistics
Introduction
Technical requirements
Maximum likelihood estimation
Getting ready
How to do it...
How it works...
There's more...
See also
Calculating densities, quantiles, and CDFs
Getting ready
How to do it...
How it works...
There's more...
Creating barplots using ggplot
Getting ready
How to do it...
How it works...
There's more...
See also
Generating random numbers from multiple distributions
Getting ready
How to do it...
How it works...
There's more...
Complex data processing with dplyr
Getting ready
How to do it...
How it works...
There's more...
See also
3D visualization with the plot3d package
Getting ready
How to do it...
How it works...
Formatting tabular data with the formattable package
Getting ready
How to do it...
How it works...
There's more...
Simple random sampling
Getting ready
How to do it...
How it works...
Creating diagrams via the DiagrammeR package
Getting ready
How to do it...
How it works...
See also
C++ in R via the Rcpp package
Getting ready
How to do it...
How it works...
See also
Interactive plots with the ggplot GUI package
Getting ready
How to do it...
How it works...
There's more...
Animations with the gganimate package
Getting ready
How to do it...
How it works...
See also
Using R6 classes
Getting ready
How to do it...
How it works...
There's more...
Modeling sequences with the TraMineR package
Getting ready
How to do it...
How it works...
There's more...
Clustering sequences with the TraMineR package
Getting ready
How to do it...
How it works...
There's more...
Displaying geographical data with the leaflet package
Getting ready
How to do it...
How it works...
Univariate and Multivariate Tests for Equality of Means
Introduction
The univariate t-test
Getting ready
How to do it...
How it works...
There's more...
The Fisher-Behrens problem
How to do it...
How it works...
There's more...
Paired t-test
How to do it...
How it works...
There's more...
Calculating ANOVA sum of squares and F tests
How to do it...
Two-way ANOVA
How to do it...
How it works...
There's more...
Type I, Type II, and Type III sum of squares
Type I
Type II
Type III
Getting ready
How to do it...
How it works...
Random effects
Getting ready
How to do it...
How it works...
There's more...
Repeated measures
Getting ready
How to do it...
How it works...
There's more...
Multivariate t-test
Getting ready...
How to do it...
How it works...
There's more...
MANOVA
Getting ready
How to do it...
How it works...
There's more...
Linear Regression
Introduction
Computing ordinary least squares estimates 
How to do it...
How it works...
Reporting results with the sjPlot package 
Getting ready 
How to do it...
How it works...
There's more...
Finding correlation between the features 
Getting ready... 
How to do it... 
Testing hypothesis 
Getting ready 
How to do it... 
How it works... 
Testing homoscedasticity 
Getting ready 
How to do it... 
How it works...
Implementing sandwich estimators 
Getting ready 
How to do it... 
How it works...
Variable selection 
Getting ready 
How to do it... 
How it works... 
Ridge regression 
Getting ready 
How to do it... 
How it works... 
Working with LASSO 
Getting ready 
How to do it...
How it works...
There's more...
Leverage, residuals, and influence 
Getting ready 
How to do it...
How it works... 
Bayesian Regression
Introduction
Getting the posterior density in STAN 
Getting ready
How to do it...
How it works...
Formulating a linear regression model
Getting ready
How to do it...
How it works...
There's more...
Assigning the priors
Defining the support
How to decide the parameters for a prior
Getting ready
How to do it...
How it works...
Doing MCMC the manual way
Getting ready
How to do it...
How it works...
Evaluating convergence with CODA
One or multiple chains?
Getting ready
How to do it...
How it works...
There's more...
Bayesian variable selection
Getting ready
How to do it...
How it works...
There's more...
See also
Using a model for prediction
Getting ready
How to do it...
How it works...
GLMs in JAGS
Getting ready
How to do it...
How it works...
Nonparametric Methods
Introduction
The Mann-Whitney test
How to do it...
How it works...
There's more...
Estimating nonparametric ANOVA
Getting ready
How to do it...
How it works...
The Spearman's rank correlation test
How to do it...
How it works...
There's more...
LOESS regression
Getting ready
How to do it...
How it works...
There's more...
Finding the best transformations via the acepack package
Getting ready
How to do it...
How it works...
There is more...
Nonparametric multivariate tests using the npmv package
Getting ready
How to do it...
How it works...
Semiparametric regression with the SemiPar package
Getting ready
How to do it...
How it works...
There's more...
Robust Methods
Introduction
Robust linear regression
Getting ready
How to do it...
How it works...
Estimating robust covariance matrices
Getting ready
How to do it...
How it works...
Robust logistic regression
Getting ready
How to do it...
How it works...
Robust ANOVA using the robust package
Getting ready
How to do it...
How it works...
Robust principal components
Getting ready
How to do it...
How it works...
Robust Gaussian mixture models with the qclust package
Getting ready
How to do it...
How it works...
Robust clustering
Getting ready
How to do it...
How it works...
Time Series Analysis
Introduction
The general ARIMA model 
Getting ready
How to do it...
How it works...
Seasonality and SARIMAX models 
Getting ready
How to do it...
There's more...
Choosing the best model with the forecast package 
Getting ready
How to do it...  
How it works... 
Vector autoregressions (VARs)  
Getting ready
How to do it...  
How it works... 
Facebook's automatic Prophet forecasting  
Getting ready
How to do it...  
How it works...
There's more...
Modeling count temporal data 
Getting ready
How to do it... 
There's more...
Imputing missing values in time series  
Getting ready
How to do it...
How it works... 
There's more... 
Anomaly detection 
Getting ready
How to do it... 
How it works... 
There's more... 
Spectral decomposition of time series 
Getting ready
How to do it... 
How it works... 
Mixed Effects Models
Introduction
The standard model and ANOVA 
Getting ready
How to do it...
How it works... 
Some useful plots for mixed effects models 
Getting ready
How to do it... 
There's more... 
Nonlinear mixed effects models 
Getting ready
How to do it... 
How it works... 
There's more... 
Crossed and nested designs 
Crossed design 
Nested design 
Getting ready 
How to do it... 
How it works.. 
Robust mixed effects models with robustlmm 
Getting ready
How to do it... 
How it works... 
Choosing the best linear mixed model
Getting ready 
How to do it... 
How it works... 
Mixed generalized linear models 
Getting ready
How to do it... 
How it works... 
There's more...
Predictive Models Using the Caret Package
Introduction
Data splitting and general model fitting
Getting ready
How to do it...
How it works...
There's more...
See also
Preprocessing
Getting ready
How to do it...
How it works...
Variable importance and feature selection
Getting ready
How to do it...
How it works...
Model tuning
Getting ready
How to do it...
How it works...
Classification in caret and ROC curves
Getting ready
How to do it...
How it works...
Gradient boosting and class imbalance
Getting ready
How to do it...
How it works...
Lasso, ridge, and elasticnet in caret
Getting ready
How to do it...
How it works...
Logic regression
Getting ready
How to do it...
How it works...
Bayesian Networks and Hidden Markov Models
Introduction
A discrete Bayesian network via bnlearn
Getting ready
How to do it...
How it works...
There's more...
See also
Conditional independence tests
Getting ready
How to do it...
How it works...
There's more...
Continuous and hybrid Bayesian networks via bnlearn
Getting ready
How to do it...
How it works...
See also
Interactive visualization of BNs with the bnviewer package
Getting ready
How to do it...
How it works...
An introductory hidden Markov model
Getting ready
How to do it...
How it works...
There's more...
Regime switching in financial data via HMM
Getting ready
How to do it...
How it works...
There's more...
Other Books You May Enjoy
Leave a review - let other readers know what you think
R is a popular programming language for developing statistical software. This book will be a useful guide to solving common and not-so-common challenges in statistics. With this book, you'll be equipped to confidently perform essential statistical procedures across your organization with the help of cutting-edge statistical tools. You'll start by implementing data modeling, data analysis, and machine learning to solve real-world problems. You'll then understand how to work with nonparametric methods, mixed effects models, and hidden Markov models. This book contains recipes that will guide you in performing univariate and multivariate hypothesis tests, several regression techniques, and using robust techniques to minimize the impact of outliers in data.You'll also learn how to use the caret package for performing machine learning in R. Furthermore, this book will help you understand how to interpret charts and plots to get insights for better decision making. By the end of this book, you will be able to apply your skills to statistical computations using R 3.5. You will also become well-versed with a wide array of statistical techniques in R that are extensively used in the data science industry.
If you are a quantitative researcher, statistician, data analyst, or data scientist looking to tackle common and not-so-common challenges in statistics, then this book is what you need! A solid understanding of R programming and a basic understanding of statistics and linear algebra.
Chapter 1, Getting Started with R and Statistics, reviews a variety of techniques in R for performing data processing, data analysis, and plotting. We will also explain how to work with some basic statistical techniques, such as sampling, maximum likelihood estimation, and random number generation. In addition, we will present some useful coding techniques, such as C++ functions using Rcpp, and R6Classes. The former will allow us to add high-performance compiled code, whereas the latter will allow us to perform object-oriented programming in R.
Chapter 2, Univariate and Multivariate Tests for Equality of Means, explains how to answer the most basic statistical question: do two (or possibly more) populations have the same mean? This arises when we want to evaluate whether certain treatment/policy is effective compared to a baseline effect. This can naturally be extended to multiple groups, and the technique used for this is called Analysis of Variance (ANOVA). ANOVA can itself be extended to accommodate multiple effects; for example, testing whether the background color of a website and the font style drive sales up. This is known as two-way ANOVA, and leads to additional complications: not only do we have multiple effects to estimate, but also we could have interaction effects happening between these two effects (for example, a certain background color could be effective when used in conjunction with a specific font type). ANOVA can also be extended in other dimensions, such as adding random effects (effects that originate from a large population and where we don't want to estimate a parameter for each one of them), or repeated measures for each observation.
A different problem arises when we have multiple variables instead of a single variable that we want to measure across (across two or more groups). In this case, we are generalizing the t-test and ANOVA to a multi-dimensional case; for the former (two groups, the technique that we use is called Hotelling's t-test, and for the latter (more than two groups), the technique is MANOVA (multiple ANOVA). We will review how to use all these techniques in R.
Chapter 3, Linear Regression, deals with the most important tool in statistics. It can be used in almost any situation where we want to predict a numeric variable in terms of lots of independent ones. As its name implies, the assumption is that there is a linear relationship between the covariates and the target. In this chapter, we will review how to formulate these models with a special focus on ordinary least squares (the most widely used algorithm for linear regression).
Chapter 4, Bayesian Regression, explains how to work with regression in a Bayesian context. Hitherto, we have assumed that there are some fixed parameters behind the data generation process (for t-tests, we assume that there are fixed means for each group), and because of sample variability, we will observe minor deviations from them. The Bayesian approach is radically different, and founded on a different methodological and epistemological foundation. The idea is that coefficients are not fixed quantities that we want to draw inferences upon, but random variables themselves.
The idea is that given a prior density (prior belief that we have) for each coefficient, we want to augment these priors using the data, in order to arrive at a posterior density. For example, if we think a person always arrives on time (this would be a prior), and we observe that this person arrived late on 8 out of 10 occasions, we should update our initial expectation accordingly. Unfortunately, Bayesian models do not generate closed formed expressions (in most practical cases), so they can't be solved easily. We will need to use sophisticated techniques to estimate these posterior densities: the tool that is used most frequently for this purpose is MCMC (Markov chain Monte Carlo). We will review how to formulate models using the best packages available: JAGS and STAN.
Chapter 5, Nonparametric Methods, explains how classical methods rely on the assumption that there is an underlying distribution (usually a Gaussian one), and derive tests for each case. For instance, the underlying assumption in t-tests is that the data originates from two Gaussian populations with the same variance. In general, these assumptions do make sense, and even when they are not met, in large samples, the violations to those assumptions become less relevant: (for example, the t-test works well for large sample even when the normality assumption is violated). But what can we do when we are working with small samples, or cases where normality is absolutely needed? Non-parametric methods are designed to work with no distributional assumptions by using a series of smart tricks that depend on each particular case. When the data follows the same distribution that we need (for example normality for t-tests), they work almost as well as the parametric ones, and when the data does not follow that distribution, they still work anyway. We will use a variety of non-parametric tools for regression, ANOVA, and many more.
Chapter 6, Robust Methods, explains why classical methods don't work well in the presence of outliers. On the other hand, robust methods are designed to intelligently flag abnormal observations, and estimate the appropriate coefficients in the presence of contamination. In this chapter, we will review some of the most frequently used robust techniques for regression, classification, ANOVA, and clustering.
Chapter 7, Time Series Analysis, describes how to work with time series (sequences of observations indexed by time). Although there are several ways of modeling them, the most widely used framework is called ARIMA. The idea is to decompose the series into the sum of deterministic and stochastic components in such a way that the past is used to predict the future of the series. It has been established that these techniques work really well with actual data but, unfortunately, they do require a lot of manual work. In this chapter, we will present several ARIMA techniques, demonstrating how to extend them to multivariate data, how to impute missing values on the series, how to detect outliers, and how to use several automatic packages that build the best model for us.
Chapter 8, Mixed Effects Models, introduces mixed effects models. These models arise when we mix fixed and random effects. Fixed effects (the ones we have used so far except for Chapter 4, Bayesian Regression) are treated as fixed parameters that are estimated. For example, if we model the sales of a product in terms of a particular month, each month will have a distinct parameter (this would be a fixed effect). On the other hand, if we were measuring whether a drug is useful for certain patients, and we had multiple observations per patient, we might want to keep a patient effect but not a coefficient for each patient. If we had 2,000 patients, those coefficients would be unmanageable and at the same time, would be introducing a lot of imprecision to our model. A neater approach would be to treat the patient effect as random: we would assume that each patient receives a random shock, and all observations belonging to the same patient will be correlated.
In this chapter, we will work with these models using the lme4 and lmer packages, and we will extend these models to non-linear mixed effects models (when the response is non-linear). The main problem for these models (both linear and non-linear) is that the degrees of freedom are unknown, rendering the usual tests useless.
Chapter 9, Predictive Models Using the Caret Package, describes how to use the caret package, which is the fundamental workhorse for (some of them have already been presented in previous chapters). It provides a consistent syntax and a unified approach for building a variety of models. In addition, it has great tools for performing preprocessing and feature selection. In this chapter, we present several models in caret, such as random forests, gradient boosting, and LASSO.
Chapter 10, Bayesian Networks and Hidden Markov Models, describes how, in some cases, we might want to model a network of relationships in such a way that we can understand how the variables are connected. For example, the office location might make employees happier, and also make them arrive earlier to work: the two combined effects might make them perform better. If they perform better, they will receive better bonuses; actually, the bonuses will be dependent on those two variables directly, and also on the office location indirectly. Bayesian networks allow us to perform complex network modeling, and the main tool used for this is the bnlearn package. Another advanced statistical tool is hidden Markov models: they allow us to estimate the state of unobserved variables by using a very complex computational machinery. In this chapter we will work with two examples using Hidden Markov Models.
Users should have some familiarity with statistics and programming. Some general knowledge of probability, regression, and data analysis is recommended.
R is required for this book, and RStudio is highly recommended. All the packages used throughout this book can be installed following the instructions for each recipe.
You can download the example code files for this book from your account at www.packt.com. If you purchased this book elsewhere, you can visit www.packt.com/support and register to have the files emailed directly to you.
You can download the code files by following these steps:
Log in or register at
www.packt.com
.
Select the
SUPPORT
tab.
Click on
Code Downloads & Errata
.
Enter the name of the book in the
Search
box and follow the onscreen instructions.
Once the file is downloaded, please make sure that you unzip or extract the folder using the latest version of:
WinRAR/7-Zip for Windows
Zipeg/iZip/UnRarX for Mac
7-Zip/PeaZip for Linux
The code bundle for the book is also hosted on GitHub at https://github.com/PacktPublishing/R_Statistics_Cookbook. In case there's an update to the code, it will be updated on the existing GitHub repository.
We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/. Check them out!
We also provide a PDF file that has color images of the screenshots/diagrams used in this book. You can download it here: https://www.packtpub.com/sites/default/files/downloads/9781789802566_ColorImages.pdf.
There are a number of text conventions used throughout this book.
CodeInText: Indicates code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles. Here is an example: "Import the ggplot2 and reshape libraries."
A block of code is set as follows:
library(bbmle)N <- 1000xx <- rgamma(N, shape=20,rate=2)
Any command-line input or output is written as follows:
> install.packages("tscount")
Bold: Indicates a new term, an important word, or words that you see on screen. For example, words in menus or dialog boxes appear in the text like this. Here is an example: "The cumulative density function (CDF) returns the cumulative probability mass for each value of X."
In this book, you will find several headings that appear frequently (Getting ready, How to do it..., How it works..., There's more..., and See also).
To give clear instructions on how to complete a recipe, use these sections as follows:
This section tells you what to expect in the recipe and describes how to set up any software or any preliminary settings required for the recipe.
This section contains the steps required to follow the recipe.
This section usually consists of a detailed explanation of what happened in the previous section.
This section consists of additional information about the recipe in order to make you more knowledgeable about the recipe.
This section provides helpful links to other useful information for the recipe.
Feedback from our readers is always welcome.
General feedback: If you have questions about any aspect of this book, mention the book title in the subject of your message and email us at [email protected].
Errata: Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you have found a mistake in this book, we would be grateful if you would report this to us. Please visit www.packt.com/submit-errata, selecting your book, clicking on the Errata Submission Form link, and entering the details.
Piracy: If you come across any illegal copies of our works in any form on the internet, we would be grateful if you would provide us with the location address or website name. Please contact us at [email protected] with a link to the material.
If you are interested in becoming an author: If there is a topic that you have expertise in, and you are interested in either writing or contributing to a book, please visit authors.packtpub.com.
Please leave a review. Once you have read and used this book, why not leave a review on the site that you purchased it from? Potential readers can then see and use your unbiased opinion to make purchase decisions, we at Packt can understand what you think about our products, and our authors can see your feedback on their book. Thank you!
For more information about Packt, please visit packt.com.
In this chapter, we will cover the following recipes:
Maximum likelihood estimation
Calculation densities, quantiles, and CDFs
Creating barplots using
ggplot
Generating random numbers from multiple distributions
Complex data processing with
dplyr
3D visualization with the
plot3d
package
Formatting tabular data with the
formattable
package
Simple random sampling
Creating plots via the
DiagrammeR
package
C++ in R via the
Rcpp
package
Interactive plots with the
ggplot GUI
package
Animations with the
gganimate
package
Using
R6
classes
Modelling sequences with the
TraMineR
package
Clustering sequences with the
TraMineR
package
Displaying geographical data with the
leaflet
package
In this chapter, we will introduce a wide array of topics regarding statistics and data analysis in R. We will use quite a diverse set of packages, most of which have been released over recent years.
We'll start by generating random numbers, fitting distributions to data, and using several packages to plot data. We will then move onto sampling, creating diagrams with theDiagrammeRpackage, and analyzing sequence data with theTraMineRpackage. We also present several techniques, not strictly related to statistics, but important for dealing with advanced methods in R—we introduce theRcpppackage (used for embedding highly efficient C++ code into your R scripts) and theR6package (used for operating withR6classes, allowing you to code using an object-oriented approach in R).
We will use R and its packages, that can be installed via the install.packages() command, and we will indicate which ones are necessary for each recipe in the corresponding Getting ready section.
Suppose we observe a hundred roulette spins, and we get red 30 times and black 70 times. We can start by assuming that the probability of getting red is 0.5 (and black is obviously 0.5). This is certainly not a very good idea, because if that was the case, we should have seen nearly red 50 times and black 50 times, but we did not. It is thus evident that a more reasonable assumption would have been a probability of 0.3 for red (and thus 0.7 for black).
The principle of maximum likelihood establishes that, given the data, we can formulate a model and tweak its parameters to maximize the probability (likelihood) of having observed what we did observe. Additionally, maximum likelihood allows us to calculate the precision (standard error) of each estimated coefficient easily. They are obtained by finding the curvature of the log-likelihood with respect to each parameter; this is obtained by finding the second-order derivatives of the log-likelihood with respect to each parameter.
The likelihood is essentially a probability composed of the multiplication of several probabilities. Multiplying lots of probabilities is never a good idea, because if the probabilities are small, we would very likely end up with a very small number. If that number is too small, then the computer won't be able to represent it accurately. Therefore, what we end up using is the log-likelihood, which is the sum of the logarithms of those probabilities.
In many situations, we also want to know if the coefficients are statistically different from zero. Imagine we have a sample of growth rates for many companies for a particular year, and we want to use the average as an indicator of whether the economy is growing or not. In other words, we want to test whether the mean is equal to zero or not. We could fit that distribution of growth rates to a Gaussian distribution (which has two parameters, ), and test whether (estimated ) is statistically equal to zero. In a Gaussian distribution, the mean is . When doing hypothesis testing, we need to specify a null hypothesis and an alternative one. For this case, the null hypothesis is that this parameter is equal to zero. Intuition would tell us that if an estimated parameter is large, we can reject the null hypothesis. The problem is that we need to define what large is. This is why we don't use the estimated coefficients, but a statistic called the Z value—this is defined as the value that we observed divided by the standard error. It can be proven that these are distributed according to a Gaussian distribution.
So, once we have the Z value statistic, how can we reject or not reject the null hypothesis? Assuming that the null hypothesis is true (that the coefficient is equal to zero), we can compute the probability that we get a test statistic as large or larger than the one we got (these are known as p-values). Remember that we assume that the coefficients have fixed values, but we will observe random deviations from them in our samples (we actually have one sample). If the probability of finding them to be as large as the ones that we observed is small, assuming that the true ones are zero, then that implies that luck alone can't explain the coefficients that we got. The final conclusion in that case is to reject the null hypothesis and conclude that the coefficient is different from zero.
The bbmle package can be installed using the install.packages("bbmle") function in R.
The LL function wraps the log-likelihood computation, and is called by the mle2 function sequentially. This function will use a derivative-based algorithm to find the maximum of the log-likelihood.
Maximum likelihood estimators converge in probability to the true values (are consistent) as long as certain regularity conditions hold (see https://en.wikipedia.org/wiki/Maximum_likelihood_estimation).
R provides a vast number of functions for working with statistical distributions. These can be either discrete or continuous. Statistical functions are important, because in statistics we generally need to assume that the data is distributed to some distribution.
Let's assume we have an variable distributed according to a specific distribution. The density function is a function that maps every value in the domain in the distribution of the variable with a probability. The cumulative density function (CDF) returns the cumulative probability mass for each value of . The quantile function expects a probability (between 0 and 1) and returns the value of that has a probability mass of to the left. For most distributions, we can use specific R functions to calculate these. On the other hand, if we want to generate random numbers according to a distribution, we can use R's random number generators random number generators (RNGs).
No specific package is needed for this recipe.
Most distributions in R have densities, cumulative densities, quantiles, and RNGs. They are generally called in R using the same approach (d for densities, q for quantiles, r for random numbers, and p for the cumulative density function) combined with the distribution name.
For example, qnorm returns the quantile function for a normal-Gaussian distribution, and qchisq returns the quantile function for the chi-squared distribution. pnorm returns the cumulative distribution function for a Gaussian distribution; pt returns it for a Student's t-distribution.
As can be seen in the diagram immediately previous, when we get the 97.7% quantile, we get 1.99, which coincides with the accumulated probability we get when we do pnorm() for x=2.
We can use the same approach for other distributions. For example, we can get the area to the left of x=3 for a chi-squared distribution with 33 degrees of freedom:
print(paste("Area to the left of x=3",pchisq(3,33)))
After running the preceding code we get the following output:
The ggplot2 package has become the dominant R package for creating serious plots, mainly due to its beautiful aesthetics. The ggplot package allows the user to define the plots in a sequential (or additive) way, and this great syntax has contributed to its enormous success. As you would expect, this package can handle a wide variety of plots.
In order to run this example, you will need the ggplot2 and the reshape packages. Both can be installed using the install.packages() command.
In order to build a stacked plot, we need to supply three arguments to theaes()function. The x variable is the x-axis,yis the bar height, and fill is the color. The geom_var variable specifies the type of bar that will be used. Thestat=identity value tells ggplot that we don't want to apply any transformation, and leave the data as it is. We will use thereshapepackage for transforming the data into the format that we need.
The result has one bar for each company, with two colors. The red color corresponds to the Adjusted Sales and the green color corresponds to the Unadjusted Sales.
An excellent ggplot2 tutorial can be found at http://r-statistics.co/Complete-Ggplot2-Tutorial-Part2-Customizing-Theme-With-R-Code.html.
R includes routines to generate random numbers from many distributions. Different distributions require different algorithms to generate random numbers. In essence, all random number generation routines rely on a uniform random number generator that generates an output between (0,1), and then some procedure that transform this number according to the density that we need.
