38,39 €
Over 100 hands-on recipes to effectively solve real-world data problems using the most popular R packages and techniques
This book is for those who are already familiar with the basic operation of R, but want to learn how to efficiently and effectively analyze real-world data problems using practical R packages.
This cookbook offers a range of data analysis samples in simple and straightforward R code, providing step-by-step resources and time-saving methods to help you solve data problems efficiently.
The first section deals with how to create R functions to avoid the unnecessary duplication of code. You will learn how to prepare, process, and perform sophisticated ETL for heterogeneous data sources with R packages. An example of data manipulation is provided, illustrating how to use the “dplyr” and “data.table” packages to efficiently process larger data structures. We also focus on “ggplot2” and show you how to create advanced figures for data exploration.
In addition, you will learn how to build an interactive report using the “ggvis” package. Later chapters offer insight into time series analysis on financial data, while there is detailed information on the hot topic of machine learning, including data classification, regression, clustering, association rule mining, and dimension reduction.
By the end of this book, you will understand how to resolve issues and will be able to comfortably offer solutions to problems encountered while performing data analysis.
This easy-to-follow guide is full of hands-on examples of data analysis with R. Each topic is fully explained beginning with the core concept, followed by step-by-step practical examples, and concluding with detailed explanations of each concept used.
Sie lesen das E-Book in den Legimi-Apps auf:
Seitenzahl: 455
Veröffentlichungsjahr: 2016
Copyright © 2016 Packt Publishing
All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.
Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing, and its dealers and distributors will be held liable for any damages caused or alleged to be caused directly or indirectly by this book.
Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.
First published: July 2016
Production reference: 1270716
Published by Packt Publishing Ltd.
Livery Place
35 Livery Street
Birmingham B3 2PB, UK.
ISBN 978-1-78439-081-5
www.packtpub.com
Author
Yu-Wei, Chiu (David Chiu)
Reviewer
Prabhanjan Tattar
Commissioning Editor
Veena Pagare
Acquisition Editor
Tushar Gupta
Content Development Editor
Pooja Mhapsekar
Technical Editor
Madhunikita Sunil Chindarkar
Copy Editor
Priyanka Ravi
Project Coordinator
Suzanne Coutinho
Proofreader
Safis Editing
Indexer
Tejal Daruwale Soni
Graphics
Jason Monteiro
Production Coordinator
Aparna Bhagat
Cover Work
Aparna Bhagat
Yu-Wei, Chiu (David Chiu) is the founder of LargitData (www.LargitData.com), a startup company that mainly focuses on providing big data and machine learning products. He has previously worked for Trend Micro as a software engineer, where he was responsible for building big data platforms for business intelligence and customer relationship management systems. In addition to being a start-up entrepreneur and data scientist, he specializes in using Spark and Hadoop to process big data and apply data mining techniques for data analysis. Yu-Wei is also a professional lecturer and has delivered lectures on big data and machine learning in R and Python, and given tech talks at a variety of conferences.
In 2015, Yu-Wei wrote Machine Learning with R Cookbook, Packt Publishing. In 2013, Yu-Wei reviewed Bioinformatics with R Cookbook, Packt Publishing. For more information, visit his personal website at www.ywchiu.com.
I have immense gratitude for my family and friends for supporting and encouraging me to complete this book. I would like to sincerely thank my mother, Ming-Yang Huang (Miranda Huang); my mentor, Man-Kwan Shan; the proofreader of this book, Brendan Fisher; members of LargitData; Data Science Program (DSP); and other friends who have offered their support.
Prabhanjan Tattar is currently working as a senior data scientist at Fractal Analytics, Inc. He has 8 years of experience as a statistical analyst. Survival analysis and statistical inference are his main areas of research and interest, and he has published several research papers in peer-reviewed journals, as well as authoring two books on R: R Statistical Application Development by Example, Packt Publishing, and A Course in Statistics with R, Wiley. The R packages gpk, RSADBE, and ACSWR are also maintained by him.
Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.PacktPub.com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at <[email protected]> for more details.
At www.PacktPub.com, you can also read a collection of free technical articles, sign up for a range of free newsletters and receive exclusive discounts and offers on Packt books and eBooks.
https://www2.packtpub.com/books/subscription/packtlib
Do you need instant solutions to your IT questions? PacktLib is Packt's online digital book library. Here, you can search, access, and read Packt's entire library of books.
Big data, the Internet of Things, and artificial intelligence have become the hottest technology buzzwords in recent years. Although there are many different terms used to define these technologies, the common concept is that they're all driven by data. Simply having data is not enough; being able to unlock its value is essential. Therefore, data scientists have begun to focus on how to gain insights from raw data.
Data science has become one of the most popular subjects among academic and industry groups. However, as data science is a very broad discipline, learning how to master it can be challenging. A beginner must learn how to prepare, process, aggregate, and visualize data. More advanced techniques involve machine learning, mining various data formats (text, image, and video), and, most importantly, using data to generate business value. The role of a data scientist is challenging and requires a great deal of effort. A successful data scientist requires a useful tool to help solve day-to-day problems.
In this field, the most widely used tool by data scientists is the R language, which is open source and free. Being a machine language, it provides many data processes, learning packages, and visualization functions, allowing users to analyze data on the fly. R helps users quickly perform analysis and execute machine learning algorithms on their dataset without knowing every detail of the sophisticated mathematical models.
R for Data Science Cookbook takes a practical approach to teaching you how to put data science into practice with R. The book has 12 chapters, each of which is introduced by breaking down the topic into several simple recipes. Through the step-by-step instructions in each recipe, you can apply what you have learned from the book by using a variety of packages in R.
The first section of this book deals with how to create R functions to avoid unnecessary duplication of code. You will learn how to prepare, process, and perform sophisticated ETL operations for heterogeneous data sources with R packages. An example of data manipulation is provided that illustrates how to use the dplyr and data.table packages to process larger data structures efficiently, while there is a section focusing on ggplot2 that covers how to create advanced figures for data exploration. Also, you will learn how to build an interactive report using the ggvis package.
This book also explains how to use data mining to discover items that are frequently purchased together. Later chapters offer insight into time series analysis on financial data, while there is detailed information on the hot topic of machine learning, including data classification, regression, clustering, and dimension reduction.
With R for Data Science Cookbook in hand, I can assure you that you will find data science has never been easier.
Chapter 1, Functions in R, describes how to create R functions. This chapter covers the basic composition, environment, and argument matching of an R function. Furthermore, we will look at advanced topics such as closure, functional programming, and how to properly handle errors.
Chapter 2, Data Extracting, Transforming, and Loading, teaches you how to read structured and unstructured data with R. The chapter begins by collecting data from text files. Subsequently, we will look at how to connect R to a database. Lastly, you will learn how to write a web scraper to crawl through unstructured data from a web page or social media site.
Chapter 3, Data Preprocessing and Preparation, introduces you to preparing data ready for analysis. In this chapter, we will cover the data preprocess steps, such as type conversion, adding, filtering, dropping, merging, reshaping, and missing-value imputation, with some basic R functions.
Chapter 4, Data Manipulation, demonstrates how to manipulate data in an efficient and effective manner with the advanced R packages data.table and dplyr. The data.table package exposes you to the possibility of quickly loading and aggregating large amounts of data. The dplyr package provides the ability to manipulate data in SQL-like syntax.
Chapter 5, Visualizing Data with ggplot2, explores using ggplot2 to visualize data. This chapter begins by introducing the basic building blocks of ggplot2. Next, we will cover advanced topics on how to create a more sophisticated graph with ggplot2 functions. Lastly, we will describe how to build a map with ggmap.
Chapter 6, Making Interactive Reports, reveals how to create a professional report with R. In the beginning, the chapter discusses how to write R markdown syntax and embed R code chunks. We will also explore how to add interactive charts to the report with ggvis. Finally, we will look at how to create and publish an R Shiny report.
Chapter 7, Simulation from Probability Distributions, begins with an emphasis on sampling data from different probability distributions. As a concrete example, we will look at how to simulate a stochastic trading process with a probability function.
Chapter 8, Statistical Inference in R, begins with a discussion on point estimation and confidence intervals. Subsequently, you will be introduced to parametric and non-parametric testing methods. Lastly, we will look at how one can use ANOVA to analyze whether the salary basis of an engineer differs based on his job title and location.
Chapter 9, Rule and Pattern Mining with R, exposes you to the common methods used to discover associated items and underlying frequency patterns from transaction data. In this chapter, we use a real-world blog as example data so that you can learn how to perform rule and pattern mining on real-world data.
Chapter 10, Time Series Mining with R, begins by introducing you to creating and manipulating time series from a finance dataset. Subsequently, we will learn how to forecast time series with HoltWinters and ARIMA. For a more concrete example, this chapter reveals how to predict stock prices with ARIMA.
Chapter 11, Supervised Machine Learning, teaches you how to build a model that makes predictions based on labeled training data. You will learn how to use regression models to make sense of numeric relationships and apply a fitted model to data for continuous value prediction. For classification, you will learn how to fit data into a tree-based classifier.
Chapter 12, Unsupervised Machine Learning, introduces you to revealing the hidden structure of unlabeled data. Firstly, we will look at how to group similarly located hotels together with the clustering method. Subsequently, we will learn how to select and extract features on the economy freedom dataset with PCA.
To follow the book's examples, you will need a computer with access to the Internet and the ability to install the R environment. You can download R from http://www.cran.r-project.org/. Detailed installation instructions are available in the first chapter.
The examples provided in this book were coded and tested with R version 3.2.4 on Microsoft Windows. The examples are likely to work with any recent version of R installed on either Mac OS X or a Unix-like OS.
R for Data Science Cookbook is intended for those who are already familiar with the basic operation of R, but want to learn how to efficiently and effectively analyze real-world data problems using practical R packages.
In this book, you will find several headings that appear frequently (Getting ready, How to do it, How it works, There's more, and See also).
To give clear instructions on how to complete a recipe, we use these sections as follows:
This section tells you what to expect in the recipe, and describes how to set up any software or any preliminary settings required for the recipe.
This section contains the steps required to follow the recipe.
This section usually consists of a detailed explanation of what happened in the previous section.
This section consists of additional information about the recipe in order to make the reader more knowledgeable about the recipe.
This section provides helpful links to other useful information for the recipe.
In this book, you will find a number of styles of text that distinguish between different kinds of information. Here are some examples of these styles, and an explanation of their meaning.
Package and function names are shown as follows: "You can then install and load the package RCurl."
A block of code is set as follows:
Any URL is written as follows:
http://data.worldbank.org/topic/economy-and-growth
Variable name, argument name, new terms and important words are shown in bold. Words that you see on the screen, in menus or dialog boxes for example, appear in the text like this: "In R, a missing value is noted with the symbol NA (not available), and an impossible value is NaN (not a number)."
Feedback from our readers is always welcome. Let us know what you think about this book—what you liked or disliked. Reader feedback is important for us as it helps us develop titles that you will really get the most out of.
To send us general feedback, simply e-mail <[email protected]>, and mention the book's title in the subject of your message.
If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, see our author guide at www.packtpub.com/authors.
Now that you are the proud owner of a Packt book, we have a number of things to help you to get the most from your purchase.
You can download the example code files for this book from your account at http://www.packtpub.com. If you purchased this book elsewhere, you can visit http://www.packtpub.com/support and register to have the files e-mailed directly to you.
You can download the code files by following these steps:
Once the file is downloaded, please make sure that you unzip or extract the folder using the latest version of:
The code bundle for the book is also hosted on GitHub at https://github.com/PacktPublishing/R-for-Data-Science-Cookbook. We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/. Check them out!
We also provide you with a PDF file that has color images of the screenshots/diagrams used in this book. The color images will help you better understand the changes in the output. You can download this file from https://www.packtpub.com/sites/default/files/downloads/RforDataScienceCookbook_ColorImages.pdf.
Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you find a mistake in one of our books—maybe a mistake in the text or the code—we would be grateful if you could report this to us. By doing so, you can save other readers from frustration and help us improve subsequent versions of this book. If you find any errata, please report them by visiting http://www.packtpub.com/submit-errata, selecting your book, clicking on the Errata Submission Form link, and entering the details of your errata. Once your errata are verified, your submission will be accepted and the errata will be uploaded to our website or added to any list of existing errata under the Errata section of that title.
To view the previously submitted errata, go to https://www.packtpub.com/books/content/support and enter the name of the book in the search field. The required information will appear under the Errata section.
Piracy of copyrighted material on the Internet is an ongoing problem across all media. At Packt, we take the protection of our copyright and licenses very seriously. If you come across any illegal copies of our works in any form on the Internet, please provide us with the location address or website name immediately so that we can pursue a remedy.
Please contact us at <[email protected]> with a link to the suspected pirated material.
We appreciate your help in protecting our authors and our ability to bring you valuable content.
If you have a problem with any aspect of this book, you can contact us at <[email protected]>, and we will do our best to address the problem.
This chapter covers the following topics:
R is the mainstream programming language of choice for data scientists. According to polls conducted by KDnuggets, a leading data analysis website, R ranked as the most popular language for analytics, data mining, and data science in the three most recent surveys (2012 to 2014). For many data scientists, R is more than just a programming language because the software also provides an interactive environment that can perform all types of data analysis.
R has many advantages in data manipulation and analysis, and the three most well-known are as follows:
R users will most likely agree that these advantages make complicated data analysis easier and more approachable. Notably, R also allows us to take the role of just a basic user or a developer. For an R user, we only need to know how a function works without requiring detailed knowledge of how it is implemented. Similarly to SPSS, we can perform various types of data analysis through R's interactive shell. On the other hand, as an R developer, we can write their function to create a new model, or they can even wrap implemented functions into a package.
Instead of explaining how to write an R program from scratch, the aim of this book is to cover how to become a developer in R. The main purpose of this chapter is to show users how to define their function to accelerate the analysis procedure. Starting with creating a function, this chapter covers the environment of R, and it explains how to create matching arguments. There is also content on how to perform functional programming in R, how to create advanced functions, such as infix operator and replacement, and how to handle errors and debug functions.
The R language is a collection of functions; a user can apply built-in functions from various packages to their project, or they can define a function for a particular purpose. In this recipe, we will show you how to create an R function.
If you are new to the R language, you can find a detailed introduction, language history, and functionality on the official R site (http://www.r-project.org/). When you are ready to download and install R, please connect to the comprehensive R archive network (http://cran.r-project.org/).
Perform the following steps in order to create your first R function:
Or, you can define your function without a return statement:
R functions are a block of organized and reusable statements, which makes programming less repetitive by allowing you to reuse code. Additionally, by modularizing statements within a function, your R code will become more readable and maintainable.
By following these steps, you can now create two addnum and addnum2 R functions, and you can successfully add two input arguments with either function. In R, the function usually takes the following form:
FunctionName is the name of the function, and arg1 and arg2 are arguments. Inside the curly braces, we can see the function body, where a body is a collection of a valid statement, expression, or assignment. At the bottom of the function, we can find the return statement, which passes expression back to the caller and exits the function.
The addnum function is in standard function syntax, which contains both body and return statement. However, you do not necessarily need to put a return statement at the end of the function. Similar to the addnum2 function, the function itself will return the last expression back to the caller.
If you want to view the composition of the function, simply type the function name on the interactive shell. You can also examine the body and formal arguments of the function further using the body and formal functions. Alternatively, you can use the args function to obtain the argument list of the function.
If you want to see the documentation of a function in R, you can use the help function or simply type ? in front of the function name. For example, if you want to examine the documentation of the sum function, you would do the following:
Besides the function name, body, and formal arguments, the environment is another basic component of a function. In a nutshell, the environment is where R manages and stores different types of variables. Besides the global environment, each function activates its environment whenever a new function is created. In this recipe, we will show you how the environment of each function works.
Ensure that you completed the previous recipes by installing R on your operating system.
Perform the following steps to work with the environment:
We can regard an R environment as a place to store and manage variables. That is, whenever we create an object or a function in R, we add an entry to the environment. By default, the top-level environment is the R_GlobalEnv global environment, and we can determine the current environment using the environment function. Then, we can use either .GlobalEnv or globalenv to print the global environment, and we can compare the environment with the identical function.
Besides the global environment, we can actually create our environment and assign variables into the new environment. In the example, we created the myenv environment and then assigned x <- 3 to myenv by placing a dollar sign after the environment name. This allows us to use the ls function to list all variables in myenv and global environment. At this point, we find x in myenv, but we can only find myenv in the global environment.
Moving on, we can determine the environment of a function. By creating a function called addnum, we can use environment to get its environment. As we created the function under global environment, the function obviously belongs to the global environment. On the other hand, when we get the environment of the lm function, we get the package name instead. That means that the lm function is in the namespace of the stat package.
Furthermore, we can print out the current environment inside a function. By invoking the addnum2 function, we can determine that the environment function outputs a different environment name from the global environment. That is, when we create a function, we also create a new environment for the global environment and link a pointer to its parent environment. To further examine this characteristic, we create another addnum3 function with a func1 nested function inside. At this point, we can print out the environment inside func1 and addnum3, and it is possible that they have completely different environments.
To get the parent environment, we can use theparent.env function. In the following example, we can see that the parent environment of parentenv is R_GlobalEnv:
Lexical scoping, also known as static binding, determines how a value binds to a free variable in a function. This is a key feature that originated from thescheme functional programming language, and it makes R different from S. In the following recipe, we will show you how lexical scoping works in R.
Ensure that you completed the previous recipes by installing R on your operating system.
Perform the following steps to understand how the scoping rule works:
There are two different types of variable binding methods: one is lexical binding, and the other is dynamic binding. Lexical binding is also called static binding in which every binding scope manages variable names and values in the lexical environment. That is, if a variable is lexically bound, it will search the binding of the nearest lexical environment. In contrast to this, dynamic binding keeps all variables and values in the global state. That is, if a variable is dynamically bound, it will bind to the most recently created variable.
To demonstrate how lexical binding works, we first create an x variable and assign 5 to x in the global environment. Then, we can create a function named tmpfunc. The function outputs x + 3 as the return value. Even though we do not assign any value to x within the tmpfunc function, x can still find the value of x as 5 in the global environment.
Next, we create another function named parentfunc. In this function, we assign x to 3 and create a childfunc nested function (a function defined within a function). At the bottom of the parentfunc body, we invoke childfunc as the function return. Here, we find that the function uses the x defined in parentfunc instead of the one defined outside parentfunc. This is because R searches the global environment for a matched symbol name, and then subsequently searches the namespace of packages on the search list.
Moving on, let's take a look at what will return if we create an x variable as a string in the global state and assign an x local variable to 5 within the function. When we invoke the localassign function, we discover that the function returns 5 instead of the string value. On the other hand, if we print out the value of x, we still see string in return. While the local variable and global variable have the same name, the assignment of the function does not alter the value of x in global state. If you need to revise the value of x in the global state, you can use the <<- notation instead.
In order to examine the search list (or path) of R, you can type search() to list the search list:
R functions evaluate arguments lazily; the arguments are evaluated as they are needed. Thus, lazy evaluation reduces the time needed for computation. In the following recipe, we will demonstrate how lazy evaluation works.
Ensure that you completed the previous recipes by installing R on your operating system.
Perform the following steps to see how lazy evaluation works:
R performs a lazy evaluation to evaluate an expression if its value is needed. This type of evaluation strategy has the following three advantages:
