R Statistics Cookbook - Francisco Juretig - E-Book

R Statistics Cookbook E-Book

Francisco Juretig

0,0
21,59 €

-100%
Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.

Mehr erfahren.
Beschreibung

Solve real-world statistical problems using the most popular R packages and techniques

Key Features

  • Learn how to apply statistical methods to your everyday research with handy recipes
  • Foster your analytical skills and interpret research across industries and business verticals
  • Perform t-tests, chi-squared tests, and regression analysis using modern statistical techniques

Book Description

R is a popular programming language for developing statistical software. This book will be a useful guide to solving common and not-so-common challenges in statistics. With this book, you'll be equipped to confidently perform essential statistical procedures across your organization with the help of cutting-edge statistical tools.

You'll start by implementing data modeling, data analysis, and machine learning to solve real-world problems. You'll then understand how to work with nonparametric methods, mixed effects models, and hidden Markov models. This book contains recipes that will guide you in performing univariate and multivariate hypothesis tests, several regression techniques, and using robust techniques to minimize the impact of outliers in data.You'll also learn how to use the caret package for performing machine learning in R. Furthermore, this book will help you understand how to interpret charts and plots to get insights for better decision making.

By the end of this book, you will be able to apply your skills to statistical computations using R 3.5. You will also become well-versed with a wide array of statistical techniques in R that are extensively used in the data science industry.

What you will learn

  • Become well versed with recipes that will help you interpret plots with R
  • Formulate advanced statistical models in R to understand its concepts
  • Perform Bayesian regression to predict models and input missing data
  • Use time series analysis for modelling and forecasting temporal data
  • Implement a range of regression techniques for efficient data modelling
  • Get to grips with robust statistics and hidden Markov models
  • Explore ANOVA (Analysis of Variance) and perform hypothesis testing

Who this book is for

If you are a quantitative researcher, statistician, data analyst, or data scientist looking to tackle various challenges in statistics, this book is what you need! Proficiency in R programming and basic knowledge of linear algebra is necessary to follow along the recipes covered in this book.

Francisco Juretig has worked for over a decade in a variety of industries such as retail, gambling and finance deploying data-science solutions. He has written several R packages, and is a frequent contributor to the open source community.

Sie lesen das E-Book in den Legimi-Apps auf:

Android
iOS
von Legimi
zertifizierten E-Readern

Seitenzahl: 382

Veröffentlichungsjahr: 2019

Bewertungen
0,0
0
0
0
0
0
Mehr Informationen
Mehr Informationen
Legimi prüft nicht, ob Rezensionen von Nutzern stammen, die den betreffenden Titel tatsächlich gekauft oder gelesen/gehört haben. Wir entfernen aber gefälschte Rezensionen.



R Statistics Cookbook

 

 

 

 

 

 

Over 100 recipes for performing complex statistical operations with R 3.5

 

 

 

 

 

 

 

 

 

Francisco Juretig

 

 

 

 

 

 

 

 

 

 

 

 

 

 

BIRMINGHAM - MUMBAI

R Statistics Cookbook

Copyright © 2019 Packt Publishing

All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing or its dealers and distributors, will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book.

Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

Commissioning Editor:Pravin DhandreAcquisition Editor: Devika BattikeContent Development Editor:Athikho Sapuni RishanaTechnical Editor:Utkarsha S. KadamCopy Editor: Safis EditingProject Coordinator:Kirti PisatProofreader: Safis EditingIndexer:Priyanka DhadkeGraphics:Jisha ChirayilProduction Coordinator:Arvindkumar Gupta

First published: March 2019

Production reference: 1280319

Published by Packt Publishing Ltd. Livery Place 35 Livery Street Birmingham B3 2PB, UK.

ISBN 978-1-78980-256-6

www.packtpub.com

 
mapt.io

Mapt is an online digital library that gives you full access to over 5,000 books and videos, as well as industry leading tools to help you plan your personal development and advance your career. For more information, please visit our website.

Why subscribe?

Spend less time learning and more time coding with practical eBooks and Videos from over 4,000 industry professionals

Improve your learning with Skill Plans built especially for you

Get a free eBook or video every month

Mapt is fully searchable

Copy and paste, print, and bookmark content

Packt.com

Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.packt.com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at [email protected] for more details.

At www.packt.com, you can also read a collection of free technical articles, sign up for a range of free newsletters, and receive exclusive discounts and offers on Packt books and eBooks. 

Contributors

About the author

Francisco Juretig has worked for over a decade in a variety of industries such as retail, gambling and finance deploying data-science solutions. He has written several R packages, and is a frequent contributor to the open source community.

 

About the reviewer

Davor Lozic is a senior software engineer interested in a variety of subjects, in particular, computer security, algorithms, and data structures. He manages teams of more than 15 engineers and is a professor when it comes to teaching what there is to know about database systems. You can contact him at [email protected]. He likes cats! If you want to talk about any aspect of technology, or if you have funny pictures of cats, feel free to contact him.

 

 

 

Packt is searching for authors like you

If you're interested in becoming an author for Packt, please visit authors.packtpub.com and apply today. We have worked with thousands of developers and tech professionals, just like you, to help them share their insight with the global tech community. You can make a general application, apply for a specific hot topic that we are recruiting an author for, or submit your own idea.

Table of Contents

Title Page

Copyright and Credits

R Statistics Cookbook

About Packt

Why subscribe?

Packt.com

Contributors

About the author

About the reviewer

Packt is searching for authors like you

Preface

Who this book is for

What this book covers

To get the most out of this book

Download the example code files

Download the color images

Conventions used

Sections

Getting ready

How to do it…

How it works…

There's more…

See also

Get in touch

Reviews

Getting Started with R and Statistics

Introduction

Technical requirements

Maximum likelihood estimation

Getting ready

How to do it...

How it works...

There's more...

See also

Calculating densities, quantiles, and CDFs

Getting ready

How to do it...

How it works...

There's more...

Creating barplots using ggplot

Getting ready

How to do it...

How it works...

There's more...

See also

Generating random numbers from multiple distributions

Getting ready

How to do it...

How it works...

There's more...

Complex data processing with dplyr

Getting ready

How to do it...

How it works...

There's more...

See also

3D visualization with the plot3d package

Getting ready

How to do it...

How it works...

Formatting tabular data with the formattable package

Getting ready

How to do it...

How it works...

There's more...

Simple random sampling

Getting ready

How to do it...

How it works...

Creating diagrams via the DiagrammeR package

Getting ready

How to do it...

How it works...

See also

C++ in R via the Rcpp package

Getting ready

How to do it...

How it works...

See also

Interactive plots with the ggplot GUI package

Getting ready

How to do it...

How it works...

There's more...

Animations with the gganimate package

Getting ready

How to do it...

How it works...

See also

Using R6 classes

Getting ready

How to do it...

How it works...

There's more...

Modeling sequences with the TraMineR package

Getting ready

How to do it...

How it works...

There's more...

Clustering sequences with the TraMineR package

Getting ready

How to do it...

How it works...

There's more...

Displaying geographical data with the leaflet package

Getting ready

How to do it...

How it works...

Univariate and Multivariate Tests for Equality of Means

Introduction

The univariate t-test

Getting ready

How to do it...

How it works...

There's more...

The Fisher-Behrens problem

How to do it...

How it works...

There's more...

Paired t-test

How to do it...

How it works...

There's more...

Calculating ANOVA sum of squares and F tests

How to do it...

Two-way ANOVA

How to do it...

How it works...

There's more...

Type I, Type II, and Type III sum of squares

Type I

Type II

Type III

Getting ready

How to do it...

How it works...

Random effects

Getting ready

How to do it...

How it works...

There's more...

Repeated measures

Getting ready

How to do it...

How it works...

There's more...

Multivariate t-test

Getting ready...

How to do it...

How it works...

There's more...

MANOVA

Getting ready

How to do it...

How it works...

There's more...

Linear Regression

Introduction

Computing ordinary least squares estimates 

How to do it...

How it works...

Reporting results with the sjPlot package 

Getting ready 

How to do it...

How it works...

There's more...

Finding correlation between the features 

Getting ready... 

How to do it... 

Testing hypothesis 

Getting ready 

How to do it... 

How it works... 

Testing homoscedasticity 

Getting ready 

How to do it... 

How it works...

Implementing sandwich estimators 

Getting ready 

How to do it... 

How it works...

Variable selection 

Getting ready 

How to do it... 

How it works... 

Ridge regression 

Getting ready 

How to do it... 

How it works... 

Working with LASSO 

Getting ready 

How to do it...

How it works...

There's more...

Leverage, residuals, and influence 

Getting ready 

How to do it...

How it works... 

Bayesian Regression

Introduction

Getting the posterior density in STAN 

Getting ready

How to do it...

How it works...

Formulating a linear regression model

Getting ready

How to do it...

How it works...

There's more...

Assigning the priors

Defining the support

How to decide the parameters for a prior

Getting ready

How to do it...

How it works...

Doing MCMC the manual way

Getting ready

How to do it...

How it works...

Evaluating convergence with CODA

One or multiple chains?

Getting ready

How to do it...

How it works...

There's more...

Bayesian variable selection

Getting ready

How to do it...

How it works...

There's more...

See also

Using a model for prediction

Getting ready

How to do it...

How it works...

GLMs in JAGS

Getting ready

How to do it...

How it works...

Nonparametric Methods

Introduction

The Mann-Whitney test

How to do it...

How it works...

There's more...

Estimating nonparametric ANOVA

Getting ready

How to do it...

How it works...

The Spearman's rank correlation test

How to do it...

How it works...

There's more...

LOESS regression

Getting ready

How to do it...

How it works...

There's more...

Finding the best transformations via the acepack package

Getting ready

How to do it...

How it works...

There is more...

Nonparametric multivariate tests using the npmv package

Getting ready

How to do it...

How it works...

Semiparametric regression with the SemiPar package

Getting ready

How to do it...

How it works...

There's more...

Robust Methods

Introduction

Robust linear regression

Getting ready

How to do it...

How it works...

Estimating robust covariance matrices

Getting ready

How to do it...

How it works...

Robust logistic regression

Getting ready

How to do it...

How it works...

Robust ANOVA using the robust package

Getting ready

How to do it...

How it works...

Robust principal components

Getting ready

How to do it...

How it works...

Robust Gaussian mixture models with the qclust package

Getting ready

How to do it...

How it works...

Robust clustering

Getting ready

How to do it...

How it works...

Time Series Analysis

Introduction

The general ARIMA model 

Getting ready

How to do it...

How it works...

Seasonality and SARIMAX models 

Getting ready

How to do it...

There's more...

Choosing the best model with the forecast package 

Getting ready

How to do it...  

How it works... 

Vector autoregressions (VARs)  

Getting ready

How to do it...  

How it works... 

Facebook's automatic Prophet forecasting  

Getting ready

How to do it...  

How it works...

There's more...

Modeling count temporal data 

Getting ready

How to do it... 

There's more...

Imputing missing values in time series  

Getting ready

How to do it...

How it works... 

There's more... 

Anomaly detection 

Getting ready

How to do it... 

How it works... 

There's more... 

Spectral decomposition of time series 

Getting ready

How to do it... 

How it works... 

Mixed Effects Models

Introduction

The standard model and ANOVA 

Getting ready

How to do it...

How it works... 

Some useful plots for mixed effects models 

Getting ready

How to do it... 

There's more... 

Nonlinear mixed effects models 

Getting ready

How to do it... 

How it works... 

There's more... 

Crossed and nested designs 

Crossed design 

Nested design 

Getting ready 

How to do it... 

How it works.. 

Robust mixed effects models with robustlmm 

Getting ready

How to do it... 

How it works... 

Choosing the best linear mixed model

Getting ready 

How to do it... 

How it works... 

Mixed generalized linear models 

Getting ready

How to do it... 

How it works... 

There's more...

Predictive Models Using the Caret Package

Introduction

Data splitting and general model fitting

Getting ready

How to do it...

How it works...

There's more...

See also

Preprocessing

Getting ready

How to do it...

How it works...

Variable importance and feature selection

Getting ready

How to do it...

How it works...

Model tuning

Getting ready

How to do it...

How it works...

Classification in caret and ROC curves

Getting ready

How to do it...

How it works...

Gradient boosting and class imbalance

Getting ready

How to do it...

How it works...

Lasso, ridge, and elasticnet in caret

Getting ready

How to do it...

How it works...

Logic regression

Getting ready

How to do it...

How it works...

Bayesian Networks and Hidden Markov Models

Introduction

A discrete Bayesian network via bnlearn

Getting ready

How to do it...

How it works...

There's more...

See also

Conditional independence tests

Getting ready

How to do it...

How it works...

There's more...

Continuous and hybrid Bayesian networks via bnlearn

Getting ready

How to do it...

How it works...

See also

Interactive visualization of BNs with the bnviewer package

Getting ready

How to do it...

How it works...

An introductory hidden Markov model

Getting ready

How to do it...

How it works...

There's more...

Regime switching in financial data via HMM

Getting ready

How to do it...

How it works...

There's more...

Other Books You May Enjoy

Leave a review - let other readers know what you think

Preface

R is a popular programming language for developing statistical software. This book will be a useful guide to solving common and not-so-common challenges in statistics. With this book, you'll be equipped to confidently perform essential statistical procedures across your organization with the help of cutting-edge statistical tools. You'll start by implementing data modeling, data analysis, and machine learning to solve real-world problems. You'll then understand how to work with nonparametric methods, mixed effects models, and hidden Markov models. This book contains recipes that will guide you in performing univariate and multivariate hypothesis tests, several regression techniques, and using robust techniques to minimize the impact of outliers in data.You'll also learn how to use the caret package for performing machine learning in R. Furthermore, this book will help you understand how to interpret charts and plots to get insights for better decision making. By the end of this book, you will be able to apply your skills to statistical computations using R 3.5. You will also become well-versed with a wide array of statistical techniques in R that are extensively used in the data science industry.

Who this book is for

If you are a quantitative researcher, statistician, data analyst, or data scientist looking to tackle common and not-so-common challenges in statistics, then this book is what you need! A solid understanding of R programming and a basic understanding of statistics and linear algebra.

What this book covers

Chapter 1, Getting Started with R and Statistics, reviews a variety of techniques in R for performing data processing, data analysis, and plotting. We will also explain how to work with some basic statistical techniques, such as sampling, maximum likelihood estimation, and random number generation. In addition, we will present some useful coding techniques, such as C++ functions using Rcpp, and R6Classes. The former will allow us to add high-performance compiled code, whereas the latter will allow us to perform object-oriented programming in R.

Chapter 2, Univariate and Multivariate Tests for Equality of Means, explains how to answer the most basic statistical question: do two (or possibly more) populations have the same mean? This arises when we want to evaluate whether certain treatment/policy is effective compared to a baseline effect. This can naturally be extended to multiple groups, and the technique used for this is called Analysis of Variance (ANOVA). ANOVA can itself be extended to accommodate multiple effects; for example, testing whether the background color of a website and the font style drive sales up. This is known as two-way ANOVA, and leads to additional complications: not only do we have multiple effects to estimate, but also we could have interaction effects happening between these two effects (for example, a certain background color could be effective when used in conjunction with a specific font type). ANOVA can also be extended in other dimensions, such as adding random effects (effects that originate from a large population and where we don't want to estimate a parameter for each one of them), or repeated measures for each observation.

A different problem arises when we have multiple variables instead of a single variable that we want to measure across (across two or more groups). In this case, we are generalizing the t-test and ANOVA to a multi-dimensional case; for the former (two groups, the technique that we use is called Hotelling's t-test, and for the latter (more than two groups), the technique is MANOVA (multiple ANOVA). We will review how to use all these techniques in R.

Chapter 3, Linear Regression, deals with the most important tool in statistics. It can be used in almost any situation where we want to predict a numeric variable in terms of lots of independent ones. As its name implies, the assumption is that there is a linear relationship between the covariates and the target. In this chapter, we will review how to formulate these models with a special focus on ordinary least squares (the most widely used algorithm for linear regression).

Chapter 4, Bayesian Regression, explains how to work with regression in a Bayesian context. Hitherto, we have assumed that there are some fixed parameters behind the data generation process (for t-tests, we assume that there are fixed means for each group), and because of sample variability, we will observe minor deviations from them. The Bayesian approach is radically different, and founded on a different methodological and epistemological foundation. The idea is that coefficients are not fixed quantities that we want to draw inferences upon, but random variables themselves.

The idea is that given a prior density (prior belief that we have) for each coefficient, we want to augment these priors using the data, in order to arrive at a posterior density. For example, if we think a person always arrives on time (this would be a prior), and we observe that this person arrived late on 8 out of 10 occasions, we should update our initial expectation accordingly. Unfortunately, Bayesian models do not generate closed formed expressions (in most practical cases), so they can't be solved easily. We will need to use sophisticated techniques to estimate these posterior densities: the tool that is used most frequently for this purpose is MCMC (Markov chain Monte Carlo). We will review how to formulate models using the best packages available: JAGS and STAN.

Chapter 5, Nonparametric Methods, explains how classical methods rely on the assumption that there is an underlying distribution (usually a Gaussian one), and derive tests for each case. For instance, the underlying assumption in t-tests is that the data originates from two Gaussian populations with the same variance. In general, these assumptions do make sense, and even when they are not met, in large samples, the violations to those assumptions become less relevant: (for example, the t-test works well for large sample even when the normality assumption is violated). But what can we do when we are working with small samples, or cases where normality is absolutely needed? Non-parametric methods are designed to work with no distributional assumptions by using a series of smart tricks that depend on each particular case. When the data follows the same distribution that we need (for example normality for t-tests), they work almost as well as the parametric ones, and when the data does not follow that distribution, they still work anyway. We will use a variety of non-parametric tools for regression, ANOVA, and many more.

Chapter 6, Robust Methods, explains why classical methods don't work well in the presence of outliers. On the other hand, robust methods are designed to intelligently flag abnormal observations, and estimate the appropriate coefficients in the presence of contamination. In this chapter, we will review some of the most frequently used robust techniques for regression, classification, ANOVA, and clustering.

Chapter 7, Time Series Analysis, describes how to work with time series (sequences of observations indexed by time). Although there are several ways of modeling them, the most widely used framework is called ARIMA. The idea is to decompose the series into the sum of deterministic and stochastic components in such a way that the past is used to predict the future of the series. It has been established that these techniques work really well with actual data but, unfortunately, they do require a lot of manual work. In this chapter, we will present several ARIMA techniques, demonstrating how to extend them to multivariate data, how to impute missing values on the series, how to detect outliers, and how to use several automatic packages that build the best model for us.

Chapter 8, Mixed Effects Models, introduces mixed effects models. These models arise when we mix fixed and random effects. Fixed effects (the ones we have used so far except for Chapter 4, Bayesian Regression) are treated as fixed parameters that are estimated. For example, if we model the sales of a product in terms of a particular month, each month will have a distinct parameter (this would be a fixed effect). On the other hand, if we were measuring whether a drug is useful for certain patients, and we had multiple observations per patient, we might want to keep a patient effect but not a coefficient for each patient. If we had 2,000 patients, those coefficients would be unmanageable and at the same time, would be introducing a lot of imprecision to our model. A neater approach would be to treat the patient effect as random: we would assume that each patient receives a random shock, and all observations belonging to the same patient will be correlated.

In this chapter, we will work with these models using the lme4 and lmer packages, and we will extend these models to non-linear mixed effects models (when the response is non-linear). The main problem for these models (both linear and non-linear) is that the degrees of freedom are unknown, rendering the usual tests useless.

Chapter 9, Predictive Models Using the Caret Package, describes how to use the caret package, which is the fundamental workhorse for (some of them have already been presented in previous chapters). It provides a consistent syntax and a unified approach for building a variety of models. In addition, it has great tools for performing preprocessing and feature selection. In this chapter, we present several models in caret, such as random forests, gradient boosting, and LASSO.

Chapter 10, Bayesian Networks and Hidden Markov Models, describes how, in some cases, we might want to model a network of relationships in such a way that we can understand how the variables are connected. For example, the office location might make employees happier, and also make them arrive earlier to work: the two combined effects might make them perform better. If they perform better, they will receive better bonuses; actually, the bonuses will be dependent on those two variables directly, and also on the office location indirectly. Bayesian networks allow us to perform complex network modeling, and the main tool used for this is the bnlearn package. Another advanced statistical tool is hidden Markov models: they allow us to estimate the state of unobserved variables by using a very complex computational machinery. In this chapter we will work with two examples using Hidden Markov Models.

To get the most out of this book

Users should have some familiarity with statistics and programming. Some general knowledge of probability, regression, and data analysis is recommended. 

R is required for this book, and RStudio is highly recommended. All the packages used throughout this book can be installed following the instructions for each recipe.

Download the example code files

You can download the example code files for this book from your account at www.packt.com. If you purchased this book elsewhere, you can visit www.packt.com/support and register to have the files emailed directly to you.

You can download the code files by following these steps:

Log in or register at

www.packt.com

.

Select the

SUPPORT

tab.

Click on

Code Downloads & Errata

.

Enter the name of the book in the

Search

box and follow the onscreen instructions.

Once the file is downloaded, please make sure that you unzip or extract the folder using the latest version of:

WinRAR/7-Zip for Windows

Zipeg/iZip/UnRarX for Mac

7-Zip/PeaZip for Linux

The code bundle for the book is also hosted on GitHub at https://github.com/PacktPublishing/R_Statistics_Cookbook. In case there's an update to the code, it will be updated on the existing GitHub repository.

We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/. Check them out!

Download the color images

We also provide a PDF file that has color images of the screenshots/diagrams used in this book. You can download it here: https://www.packtpub.com/sites/default/files/downloads/9781789802566_ColorImages.pdf.

Conventions used

There are a number of text conventions used throughout this book.

CodeInText: Indicates code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles. Here is an example: "Import the ggplot2 and reshape libraries."

A block of code is set as follows:

library(bbmle)N <- 1000xx <- rgamma(N, shape=20,rate=2)

Any command-line input or output is written as follows:

> install.packages("tscount")

Bold: Indicates a new term, an important word, or words that you see on screen. For example, words in menus or dialog boxes appear in the text like this. Here is an example: "The cumulative density function (CDF) returns the cumulative probability mass for each value of X."

Warnings or important notes appear like this.
Tips and tricks appear like this.

Sections

In this book, you will find several headings that appear frequently (Getting ready, How to do it..., How it works..., There's more..., and See also).

To give clear instructions on how to complete a recipe, use these sections as follows:

Getting ready

This section tells you what to expect in the recipe and describes how to set up any software or any preliminary settings required for the recipe.

How to do it…

This section contains the steps required to follow the recipe.

How it works…

This section usually consists of a detailed explanation of what happened in the previous section.

There's more…

This section consists of additional information about the recipe in order to make you more knowledgeable about the recipe.

See also

This section provides helpful links to other useful information for the recipe.

Get in touch

Feedback from our readers is always welcome.

General feedback: If you have questions about any aspect of this book, mention the book title in the subject of your message and email us at [email protected].

Errata: Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you have found a mistake in this book, we would be grateful if you would report this to us. Please visit www.packt.com/submit-errata, selecting your book, clicking on the Errata Submission Form link, and entering the details.

Piracy: If you come across any illegal copies of our works in any form on the internet, we would be grateful if you would provide us with the location address or website name. Please contact us at [email protected] with a link to the material.

If you are interested in becoming an author: If there is a topic that you have expertise in, and you are interested in either writing or contributing to a book, please visit authors.packtpub.com.

Reviews

Please leave a review. Once you have read and used this book, why not leave a review on the site that you purchased it from? Potential readers can then see and use your unbiased opinion to make purchase decisions, we at Packt can understand what you think about our products, and our authors can see your feedback on their book. Thank you!

For more information about Packt, please visit packt.com.

Getting Started with R and Statistics

In this chapter, we will cover the following recipes:

Maximum likelihood estimation

Calculation densities, quantiles, and CDFs

Creating barplots using

ggplot

Generating random numbers from multiple distributions

Complex data processing with

dplyr

3D visualization with the

plot3d

package

Formatting tabular data with the

formattable

package

Simple random sampling

Creating plots via the

DiagrammeR

package

C++ in R via the 

Rcpp

package

Interactive plots with the

ggplot GUI

package

Animations with the

gganimate

package

Using

R6

classes

Modelling sequences with the

TraMineR

package

Clustering sequences with the

TraMineR

package

Displaying geographical data with the

leaflet

package

Introduction

In this chapter, we will introduce a wide array of topics regarding statistics and data analysis in R. We will use quite a diverse set of packages, most of which have been released over recent years.

We'll start by generating random numbers, fitting distributions to data, and using several packages to plot data. We will then move onto sampling, creating diagrams with theDiagrammeRpackage, and analyzing sequence data with theTraMineRpackage. We also present several techniques, not strictly related to statistics, but important for dealing with advanced methods in R—we introduce theRcpppackage (used for embedding highly efficient C++ code into your R scripts) and theR6package (used for operating withR6classes, allowing you to code using an object-oriented approach in R).

Technical requirements

We will use R and its packages, that can be installed via the install.packages() command, and we will indicate which ones are necessary for each recipe in the corresponding Getting ready section.

Maximum likelihood estimation

Suppose we observe a hundred roulette spins, and we get red 30 times and black 70 times. We can start by assuming that the probability of getting red is 0.5 (and black is obviously 0.5). This is certainly not a very good idea, because if that was the case, we should have seen nearly red 50 times and black 50 times, but we did not. It is thus evident that a more reasonable assumption would have been a probability of 0.3 for red (and thus 0.7 for black).

The principle of maximum likelihood establishes that, given the data, we can formulate a model and tweak its parameters to maximize the probability (likelihood) of having observed what we did observe. Additionally, maximum likelihood allows us to calculate the precision (standard error) of each estimated coefficient easily. They are obtained by finding the curvature of the log-likelihood with respect to each parameter; this is obtained by finding the second-order derivatives of the log-likelihood with respect to each parameter.

The likelihood is essentially a probability composed of the multiplication of several probabilities. Multiplying lots of probabilities is never a good idea, because if the probabilities are small, we would very likely end up with a very small number. If that number is too small, then the computer won't be able to represent it accurately. Therefore, what we end up using is the log-likelihood, which is the sum of the logarithms of those probabilities.

In many situations, we also want to know if the coefficients are statistically different from zero. Imagine we have a sample of growth rates for many companies for a particular year, and we want to use the average as an indicator of whether the economy is growing or not. In other words, we want to test whether the mean is equal to zero or not. We could fit that distribution of growth rates to a Gaussian distribution (which has two parameters, ), and test whether  (estimated ) is statistically equal to zero. In a Gaussian distribution, the mean is . When doing hypothesis testing, we need to specify a null hypothesis and an alternative one. For this case, the null hypothesis is that this parameter is equal to zero. Intuition would tell us that if an estimated parameter is large, we can reject the null hypothesis. The problem is that we need to define what large is. This is why we don't use the estimated coefficients, but a statistic called the Z value—this is defined as the value that we observed divided by the standard error. It can be proven that these are distributed according to a Gaussian distribution. 

So, once we have the Z value statistic, how can we reject or not reject the null hypothesis? Assuming that the null hypothesis is true (that the coefficient is equal to zero), we can compute the probability that we get a test statistic as large or larger than the one we got (these are known as p-values). Remember that we assume that the coefficients have fixed values, but we will observe random deviations from them in our samples (we actually have one sample). If the probability of finding them to be as large as the ones that we observed is small, assuming that the true ones are zero, then that implies that luck alone can't explain the coefficients that we got. The final conclusion in that case is to reject the null hypothesis and conclude that the coefficient is different from zero.

 

Getting ready

The bbmle package can be installed using the install.packages("bbmle") function in R.

How it works...

The LL function wraps the log-likelihood computation, and is called by the mle2 function sequentially. This function will use a derivative-based algorithm to find the maximum of the log-likelihood.

See also

Maximum likelihood estimators converge in probability to the true values (are consistent) as long as certain regularity conditions hold (see https://en.wikipedia.org/wiki/Maximum_likelihood_estimation).

Calculating densities, quantiles, and CDFs

R provides a vast number of functions for working with statistical distributions. These can be either discrete or continuous. Statistical functions are important, because in statistics we generally need to assume that the data is distributed to some distribution. 

Let's assume we have an  variable distributed according to a specific distribution. The density function is a function that maps every value in the domain in the distribution of the variable with a probability. The cumulative density function (CDF) returns the cumulative probability mass for each value of . The quantile function expects a probability  (between 0 and 1) and returns the value of  that has a probability mass of  to the left. For most distributions, we can use specific R functions to calculate these. On the other hand, if we want to generate random numbers according to a distribution, we can use R's random number generators random number generators (RNGs).

Getting ready

No specific package is needed for this recipe.

How it works...

Most distributions in R have densities, cumulative densities, quantiles, and RNGs. They are generally called in R using the same approach (d for densities, q for quantiles, r for random numbers, and p for the cumulative density function) combined with the distribution name.

For example, qnorm returns the quantile function for a normal-Gaussian distribution, and qchisq returns the quantile function for the chi-squared distribution. pnorm returns the cumulative distribution function for a Gaussian distribution; pt returns it for a Student's t-distribution.

As can be seen in the diagram immediately previous, when we get the 97.7% quantile, we get 1.99, which coincides with the accumulated probability we get when we do pnorm() for x=2. 

There's more...

We can use the same approach for other distributions. For example, we can get the area to the left of x=3 for a chi-squared distribution with 33 degrees of freedom:

print(paste("Area to the left of x=3",pchisq(3,33)))

After running the preceding code we get the following output:

Creating barplots using ggplot

The ggplot2 package has become the dominant R package for creating serious plots, mainly due to its beautiful aesthetics. The ggplot package allows the user to define the plots in a sequential (or additive) way, and this great syntax has contributed to its enormous success. As you would expect, this package can handle a wide variety of plots.

Getting ready

In order to run this example, you will need the ggplot2 and the reshape packages. Both can be installed using the install.packages() command.

How it works...

In order to build a stacked plot, we need to supply three arguments to theaes()function. The x variable is the x-axis,yis the bar height, and fill is the color. The geom_var variable specifies the type of bar that will be used. Thestat=identity value tells ggplot that we don't want to apply any transformation, and leave the data as it is. We will use thereshapepackage for transforming the data into the format that we need.

The result has one bar for each company, with two colors. The red color corresponds to the Adjusted Sales and the green color corresponds to the Unadjusted Sales.

See also

An excellent ggplot2 tutorial can be found at http://r-statistics.co/Complete-Ggplot2-Tutorial-Part2-Customizing-Theme-With-R-Code.html.

Generating random numbers from multiple distributions

R includes routines to generate random numbers from many distributions. Different distributions require different algorithms to generate random numbers. In essence, all random number generation routines rely on a uniform random number generator that generates an output between (0,1), and then some procedure that transform this number according to the density  that we need.