45,59 €
This step-by-step guide demonstrates how to build simple-to-advanced applications through examples in R using modern tools.
This books is for aspiring data science professionals or statisticians who would like to learn about the R programming language in a practical manner. Basic programming knowledge is assumed.
R is a high-level statistical language and is widely used among statisticians and data miners to develop analytical applications. Often, data analysis people with great analytical skills lack solid programming knowledge and are unfamiliar with the correct ways to use R. Based on the version 3.4, this book will help you develop strong fundamentals when working with R by taking you through a series of full representative examples, giving you a holistic view of R.
We begin with the basic installation and configuration of the R environment. As you progress through the exercises, you'll become thoroughly acquainted with R's features and its packages. With this book, you will learn about the basic concepts of R programming, work efficiently with graphs, create publication-ready and interactive 3D graphs, and gain a better understanding of the data at hand. The detailed step-by-step instructions will enable you to get a clean set of data, produce good visualizations, and create reports for the results. It also teaches you various methods to perform code profiling and performance enhancement with good programming practices, delegation, and parallelization.
By the end of this book, you will know how to efficiently work with data, create quality visualizations and reports, and develop code that is modular, expressive, and maintainable.
This is an easy-to-understand guide filled with real-world examples, giving you a holistic view of R and practical, hands-on experience.
Sie lesen das E-Book in den Legimi-Apps auf:
Seitenzahl: 645
Veröffentlichungsjahr: 2017
BIRMINGHAM - MUMBAI
Copyright © 2017 Packt Publishing
All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.
Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing, and its dealers and distributors will be held liable for any damages caused or alleged to be caused directly or indirectly by this book.
Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.
First published: December 2017
Production reference: 1201217
ISBN 978-1-78829-254-2
www.packtpub.com
Author
Omar Trejo Navarro
Copy Editor
Pranjali Chury
Reviewer
Peter C. Figliozzi
Project Coordinator
Vaidehi Sawant
Commissioning Editor
Merint Mathew
Proofreader
Safis Editing
Acquisition Editor
Karan Sadawana
Indexer
Tejal Daruwale Soni
Content Development Editor
Rohit Kumar Singh
Graphics
Jason Monteiro
Technical Editor
Ruvika Rao
Production Coordinator
Shraddha Falebhai
Omar Trejo Navarro is a data consultant. He co-founded Datata (datata.mx), is actively working on CVEST (cvest.tech), and maintains a personal website (otrenav.com). He is an applied mathematics and economics double major from ITAM (itam.mx) in Mexico City, where he continues to work as a research assistant. He does software development with a focus on data platforms, data science, and web applications. He has worked with clients from all over the world, and is a keen supporter of open source, open data, and open science in general. He can be reached through his personal website (otrenav.com).
Peter C. Figliozzi, PhD, is a professional data scientist and software developer. He works on problems in many areas, including anomaly detection, automated trading, and fraud prevention. Peter uses R through RStudio for ad hoc analysis, modeling, and visualization.
For support files and downloads related to your book, please visit www.PacktPub.com.
Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.PacktPub.com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at [email protected] for more details.
At www.PacktPub.com, you can also read a collection of free technical articles, sign up for a range of free newsletters and receive exclusive discounts and offers on Packt books and eBooks.
https://www.packtpub.com/mapt
Get the most in-demand software skills with Mapt. Mapt gives you full access to all Packt books and video courses, as well as industry-leading tools to help you plan your personal development and advance your career.
Fully searchable across every book published by Packt
Copy and paste, print, and bookmark content
On demand and accessible via a web browser
Thanks for purchasing this Packt book. At Packt, quality is at the heart of our editorial process. To help us improve, please leave us an honest review on this book's Amazon page at https://www.amazon.com/dp/1788292545.
If you'd like to join our team of regular reviewers, you can email us at [email protected]. We award our regular reviewers with free eBooks and videos in exchange for their valuable feedback. Help us be relentless in improving our products!
Preface
What this book covers
What you need for this book
Who this book is for
Conventions
Reader feedback
Customer support
Downloading the example code
Downloading the color images of this book
Errata
Piracy
Questions
Introduction to R
What R is and what it isn't
The inspiration for R – the S language
R is a high quality statistical computing system
R is a flexible programming language
R is free, as in freedom and as in free beer
What R is not good for
Comparing R with other software
The interpreter and the console
Tools to work efficiently with R
Pick an IDE or a powerful editor
The send to console functionality
The efficient write-execute loop
Executing R code in non-interactive sessions
How to use this book
Tracking state with symbols and variables
Working with data types and data structures
Numerics
Special values
Characters
Logicals
Vectors
Factors
Matrices
Lists
Data frames
Divide and conquer with functions
Optional arguments
Functions as arguments
Operators are functions
Coercion
Complex logic with control structures
If… else conditionals
For loops
While loops
The examples in this book
Summary
Understanding Votes with Descriptive Statistics
This chapter's required packages
The Brexit votes example
Cleaning and setting up the data
Summarizing the data into a data frame
Getting intuition with graphs and correlations
Visualizing variable distributions
Using matrix scatter plots for a quick overview
Getting a better look with detailed scatter plots
Understanding interactions with correlations
Creating a new dataset with what we've learned
Building new variables with principal components
Putting it all together into high-quality code
Planning before programming
Understanding the fundamentals of high-quality code
Programming by visualizing the big picture
Summary
Predicting Votes with Linear Models
Required packages
Setting up the data
Training and testing datasets
Predicting votes with linear models
Checking model assumptions
Checking linearity with scatter plots
Checking normality with histograms and quantile-quantile plots
Checking homoscedasticity with residual plots
Checking no collinearity with correlations
Measuring accuracy with score functions
Programatically finding the best model
Generating model combinations
Predicting votes from wards with unknown data
Summary
Simulating Sales Data and Working with Databases
Required packages
Designing our data tables
The basic variables
Simplifying assumptions
Potential pitfalls
The too-much-empty-space problem
The too-much-repeated-data problem
Simulating the sales data
Simulating numeric data according to distribution assumptions
Simulating categorical values using factors
Simulating dates within a range
Simulating numbers under shared restrictions
Simulating strings for complex identifiers
Putting everything together
Simulating the client data
Simulating the client messages data
Working with relational databases
Summary
Communicating Sales with Visualizations
Required packages
Extending our data with profit metrics
Building blocks for reusable high-quality graphs
Starting with simple applications for bar graphs
Adding a third dimension with colors
Graphing top performers with bar graphs
Graphing disaggregated data with boxplots
Scatter plots with joint and marginal distributions
Pricing and profitability by protein source and continent
Client birth dates, gender, and ratings
Developing our own graph type – radar graphs
Exploring with interactive 3D scatter plots
Looking at dynamic data with time-series
Looking at geographical data with static maps
Navigating geographical data with interactive maps
Maps you can navigate and zoom-in to
High-tech-looking interactive globe
Summary
Understanding Reviews with Text Analysis
This chapter's required packages
What is text analysis and how does it work?
Preparing, training, and testing data
Building the corpus with tokenization and data cleaning
Document feature matrices
Training models with cross validation
Training our first predictive model
Improving speed with parallelization
Computing predictive accuracy and confusion matrices
Improving our results with TF-IDF
Adding flexibility with N-grams
Reducing dimensionality with SVD
Extending our analysis with cosine similarity
Digging deeper with sentiment analysis
Testing our predictive model with unseen data
Retrieving text data from Twitter
Summary
Developing Automatic Presentations
Required packages
Why invest in automation?
Literate programming as a content creation methodology
Reproducibility as a benefit of literate programming
The basic tools for an automation pipeline
A gentle introduction to Markdown
Text
Headers
Header Level 1
Header Level 2
Header Level 3
Header Level 4
Lists
Tables
Links
Images
Quotes
Code
Mathematics
Extending Markdown with R Markdown
Code chunks
Tables
Graphs
Chunk options
Global chunk options
Caching
Producing the final output with knitr
Developing graphs and analysis as we normally would
Building our presentation with R Markdown
Summary
Object-Oriented System to Track Cryptocurrencies
This chapter's required packages
The cryptocurrencies example
A brief introduction to object-oriented programming
The purpose of object-oriented programming
Important concepts behind object-oriented languages
Encapsulation
Polymorphism
Hierarchies
Classes and constructors
Public and private methods
Interfaces, factories, and patterns in general
Introducing three object models in R – S3, S4, and R6
The first source of confusion – various object models
The second source of confusion – generic functions
The S3 object model
Classes, constructors, and composition
Public methods and polymorphism
Encapsulation and mutability
Inheritance
The S4 object model
Classes, constructors, and composition
Public methods and polymorphism
Encapsulation and mutability
Inheritance
The R6 object model
Classes, constructors, and composition
Public methods and polymorphism
Encapsulation and mutability
Inheritance
Active bindings
Finalizers
The architecture behind our cryptocurrencies system
Starting simple with timestamps using S3 classes
Implementing cryptocurrency assets using S4 classes
Implementing our storage layer with R6 classes
Communicating available behavior with a database interface
Implementing a database-like storage system with CSV files
Easily allowing new database integration with a factory
Encapsulating multiple databases with a storage layer
Retrieving live data for markets and wallets with R6 classes
Creating a very simple requester to isolate API calls
Developing our exchanges infrastructure
Developing our wallets infrastructure
Implementing our wallet requesters
Finally introducing users with S3 classes
Helping ourselves with a centralized settings file
Saving our initial user data into the system
Activating our system with two simple functions
Some advice when working with object-oriented systems
Summary
Implementing an Efficient Simple Moving Average
Required packages
Starting by using good algorithms
Just how much impact can algorithm selection have?
How fast is fast enough?
Calculating simple moving averages inefficiently
Simulating the time-series
Our first (very inefficient) attempt at an SMA
Understanding why R can be slow
Object immutability
Interpreted dynamic typings
Memory-bound processes
Single-threaded processes
Measuring by profiling and benchmarking
Profiling fundamentals with Rprof()
Benchmarking manually with system.time()
Benchmarking automatically with microbenchmark()
Easily achieving high benefit - cost improvements
Using the simple data structure for the job
Vectorizing as much as possible
Removing unnecessary logic
Moving checks out of iterative processes
If you can, avoid iterating at all
Using R's way of iterating efficiently
Avoiding sending data structures with overheads
Using parallelization to divide and conquer
How deep does the parallelization rabbit hole go?
Practical parallelization with R
Using C++ and Fortran to accelerate calculations
Using an old-school approach with Fortran
Using a modern approach with C++
Looking back at what we have achieved
Other topics of interest to enhance performance
Preallocating memory to avoid duplication
Making R code a bit faster with byte code compilation
Just-in-time (JIT) compilation of R code
Using memoization or cache layers
Improving our data and memory management
Using specialized packages for performance
Flexibility and power with cloud computing
Specialized R distributions
Summary
Adding Interactivity with Dashboards
Required packages
Introducing the Shiny application architecture and reactivity
What is functional reactive programming and why is it useful?
How is functional reactivity handled within Shiny?
The building blocks for reactivity in Shiny
The input, output, and rendering functions
Designing our high-level application structure
Setting up a two-column distribution
Introducing sections with panels
Inserting a dynamic data table
Introducing interactivity with user input
Setting up static user inputs
Setting up dynamic options in a drop-down
Setting up dynamic input panels
Adding a summary table with shared data
Adding a simple moving average graph
Adding interactivity with a secondary zoom-in graph
Styling our application with themes
Other topics of interest
Adding static images
Adding HTML to your web application
Adding custom CSS styling
Sharing your newly created application
Summary
Required Packages
External requirements – software outside of R
Dependencies for the RMySQL R package
Ubuntu 17.10
macOS High Sierra
Setting up user/password in both Linux and macOS
Dependencies for the rgl and rgdal R packages
Ubuntu 17.10
macOS High Sierra
Dependencies for the Rcpp package and the .Fortran() function
Ubuntu 17.10
macOS High Sierra
Internal requirements – R packages
Loading R packages
In a world where data is becoming increasingly important, data analysts, scientists, and business people need tools to analyze and process large volumes of data efficiently. This book is my attempt to pass on what I've learned so far, so that you can quickly become an effective and efficient R programmer. Reading it will help you understand how to use R to solve complex problems, avoid some of the mistakes I've made, and teach you useful techniques that can be helpful in a variety of contexts. In the process, I hope to show you that, despite its uncommon aspects, R is an elegant and powerful language, and is well suited for data analysis and statistics, as well as complex systems.
After reading this book, you will be familiar with R's fundamentals, as well as some of its advanced features. You will understand data structures, and you will know how to efficiently deal with them. You will also understand how to design complex systems that perform efficiently, and how to make these systems usable by other people through web applications. At a lower level, you will understand how to work with object-oriented programming, functional programming, and reactive programming, and what code may be better written in each of these paradigms. You will learn how to use various cutting edge tools that R provides to develop software, how to identify performance bottlenecks, and how to fix them, possibly using other programming languages such as Fortran and C++. Finally, you will be comfortable reading and understanding the majority of R code, as well as provide feedback for others' code.
Chapter 1, Introduction to R, covers the R basics you need to understand the rest of the examples. It is not meant to be a thorough introduction to R. Rather, it's meant to give you the very basic concepts and techniques you need to quickly get started with the three examples contained in the book, and which I introduce next.
This book uses three examples to showcase R's wide range of functionality. The first example shows how to analyze votes with descriptive statistics and linear models, and it is presented in Chapter 2, Understanding Votes with Descriptive Statistics and Chapter 3, Predicting Votes with Linear Models.
Chapter 2, Understanding Votes with Descriptive Statistics, shows how to programatically create hundreds of graphs to identify relations within data visually. It shows how to create histograms, scatter plots, correlation matrices, and how to perform Principal Component Analysis (PCA).
Chapter 3, Predicting Votes with Linear Models, shows how to programatically find the best predictive linear model for a set of data, and according to different success metrics. It also shows how to check model assumptions, and how to use cross validation to increase confidence in your results.
The second example shows how to simulate data, visualize it, analyze its text components, and create automatic presentations with it.
Chapter 4, Simulating Sales Data and Working with Databases, shows how to design data schema and simulate the various types of data. It also shows how to integrate real text data with simulated data, and how to use a SQL database to access it more efficiently.
Chapter 5, Communicating Sales with Visualization, shows how to produce basic to advanced graphs, highly customized graphs. It also shows how to create dynamic 3D graphs and interactive maps.
Chapter 6, Understanding Reviews with Text Analysis, shows how to perform text analysis step by step using Natural Language Processing (NLP) techniques, as well as sentiment analysis.
Chapter 7, Developing Automatic Presentations, shows how to put together the results of previous chapters to create presentations that can be automatically updated with the latest data using tools such as knitr and R Markdown.
Finally, the third example shows how to design and develop complex object-oriented systems that retrieve real-time data from cryptocurrency markets, as well as how to optimize implementations and how to build web applications around such systems.
Chapter 8, Object-Oriented System to Track Cryptocurrencies, introduces basic object-oriented techniques that produce complex systems when combined. Furthermore, it shows how to work with three of R’s most used object models, which are S3, S4, and R6, as well as how to make them work together.
Chapter 9, Implementing an Efficient Simple Moving Average, shows how to iteratively improve an implementation for a Simple Moving Average (SMA), starting with what is considered to be bad code, all the way to advanced optimization techniques using parallelization, and delegation to the Fortran and C++ languages.
Chapter 10, Adding Interactivity with Dashboards, shows how to wrap what was built during the previous two chapters to produce a modern web application using reactive programming through the Shiny package.
Appendix, Required Packages, shows how to install the internal and external software necessary to replicate the examples in the book. Specifically, it will walk through the installation processes for Linux and macOS, but Windows follows similar principles and should not cause any problems.
This book was written in a Linux environment (specifically Ubuntu 17.10), and was also tested with a macOS, High Sierra. Even though it was not tested on a Windows computer, all of the R code presented in this book should work fine with one. The only substantial difference is that when I show you how to perform a task using a Terminal, it will be the bash terminal, which is available in Linux and macOS by default. In the case of Windows, you will need to use the cmd.exe terminal, for which you can find a lot of information online. Keep in mind that if you're using a Windows computer, you should be prepared to do a bit more research on your end to replicate the same functionality, but you should not have much trouble at all.
In the appendix, I show you how to install the software you need to replicate the examples shown in this book. I show you how to do so for Linux and macOS, specifically Ubuntu 17.10 and High Sierra. If you're using Windows, the same principles apply but the specifics may be a bit different. However, I'm sure it will not be too hard in any case.
There are two types of requirements you need to be able to execute all the code in this book: external and internal. Software outside of R is what I call external requirements. Software inside of R, meaning R packages, is what I refer to as internal requirements. I walk you through the installation of both of them in the appendix.
This book is for those who wish to develop software in R. You don't need to be an expert or professional programmer to follow this book, but you do need to be interested in learning how R works. My hope is that this book is useful for people ranging from beginners to advanced by providing hands-on examples that may help you understand R in ways you previously did not.
I assume basic programming, mathematical, and statistical knowledge, because there are various parts in the book where concepts from these disciplines will be used, and they will not be explained in detail. If you have programmed something yourself in any programming language, know basic linear algebra and statistics, and know what linear regression is, you have everything you need to understand this book.
This book was written for people in a variety of contexts and with diverse profiles. For example, if you are an analyst employed by an organization that requires you to do frequent data processing to produce reports on a regular basis, and you need to develop programs to automate such tasks, this book is for you. If you are an academic researcher who wants to use current techniques, combine them, and develop tools to test them automatically, this book is for you. If you're a professional programmer looking for ways to take advantage of advanced R features, this book is for you. Finally, if you're preparing for a future in which data will be of paramount importance (it already is), this book is for you.
Feedback from our readers is always welcome. Let us know what you think about this book-what you liked or disliked. Reader feedback is important for us as it helps us develop titles that you will really get the most out of. To send us general feedback, simply email [email protected], and mention the book's title in the subject of your message. If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, see our author guide at www.packtpub.com/authors.
Now that you are the proud owner of a Packt book, we have a number of things to help you to get the most from your purchase.
You can download the example code files for this book from your account at http://www.packtpub.com. If you purchased this book elsewhere, you can visit http://www.packtpub.com/support and register to have the files emailed directly to you. You can download the code files by following these steps:
Log in or register to our website using your email address and password.
Hover the mouse pointer on the
SUPPORT
tab at the top.
Click on
Code Downloads & Errata
.
Enter the name of the book in the
Search
box.
Select the book for which you're looking to download the code files.
Choose from the drop-down menu where you purchased this book from.
Click on
Code Download
.
Once the file is downloaded, please make sure that you unzip or extract the folder using the latest version of:
WinRAR / 7-Zip for Windows
Zipeg / iZip / UnRarX for Mac
7-Zip / PeaZip for Linux
The code bundle for the book is also hosted on GitHub at https://github.com/PacktPublishing/R-Programming-By-Example. We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/. Check them out!
We also provide you with a PDF file that has color images of the screenshots/diagrams used in this book. The color images will help you better understand the changes in the output. You can download this file from https://www.packtpub.com/sites/default/files/downloads/RProgrammingByExample_ColorImages.pdf.
Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you find a mistake in one of our books-maybe a mistake in the text or the code-we would be grateful if you could report this to us. By doing so, you can save other readers from frustration and help us improve subsequent versions of this book. If you find any errata, please report them by visiting http://www.packtpub.com/submit-errata, selecting your book, clicking on the Errata Submission Form link, and entering the details of your errata. Once your errata are verified, your submission will be accepted and the errata will be uploaded to our website or added to any list of existing errata under the Errata section of that title.
To view the previously submitted errata, go to https://www.packtpub.com/books/content/support and enter the name of the book in the search field. The required information will appear under the Errata section.
Piracy of copyrighted material on the internet is an ongoing problem across all media. At Packt, we take the protection of our copyright and licenses very seriously. If you come across any illegal copies of our works in any form on the internet, please provide us with the location address or website name immediately so that we can pursue a remedy.
Please contact us at [email protected] with a link to the suspected pirated material. We appreciate your help in protecting our authors and our ability to bring you valuable content.
If you have a problem with any aspect of this book, you can contact us at [email protected], and we will do our best to address the problem.
In a world where data is becoming increasingly important, business people and scientists need tools to analyze and process large volumes of data efficiently. R is one of the tools that has become increasingly popular in recent years for data processing, statistical analysis, and data science, and while R has its roots in academia, it is now used by organizations across a wide range of industries and geographical areas.
Some of the important topics covered in this chapter are as follows:
History of R and why it was designed the way it was
What the interpreter and the console are and how to use them
How to work with basic data types and data structures of R
How to divide work by using functions in different ways
How to introduce complex logic with control structures
When it comes to choosing software for statistical computing, it's tough to argue against R. Who could dislike a high quality, cross-platform, open source, statistical software product? It has an interactive console for exploratory work. It can run as a scripting language to replicate processes. It has a lot of statistical models built in, so you don't have to reinvent the wheel, but when the base toolset is not enough, you have access to a rich ecosystem of external packages. And, it's free! No wonder R has become a favorite in the age of data.
R was inspired by the S statistical language developed by John Chambers at AT&T. The name S is an allusion to another one-letter-name programming language also developed at AT&T, the famous C language. R was created by Ross Ihaka and Robert Gentleman in the Department of Statistics at the University of Auckland in 1991.
The general S philosophy sets the stage for the design of the R language itself, which many programmers coming from other programming languages find somewhat odd and confusing. In particular, it's important to realize that S was developed to make data analysis as easy as possible.
The key part here is the transition from analyst to developer. They wanted to build a language that could easily service both types of users. They wanted to build language that would be suitable for interactive data analysis through a command line but which could also be used to program complex systems, like traditional programming languages.
It's no coincidence that this book is structured that way. We will start doing data analysis first, and we will gradually move toward developing a full and complex system for information retrieval with a web application on top.
R is comparable, and often superior, to commercial products when it comes to programming capabilities, complex systems development, graphic production, and community ecosystems. Researchers in statistics and machine learning, as well as many other data-related disciplines, will often publish R packages to accompany their publications. This translates into immediate public access to the very latest statistical techniques and implementations. Whatever model or graphic you're trying to develop, chances are that someone has already tried it, and if not, you can at least learn from their efforts.
As we have seen, in addition to providing statistical tools, R is a general-purpose programming language. You can use R to extend its own functionality, automate processes that make use of complex systems, and many other things. It incorporates features from other object-oriented programming languages and has strong foundations for functional programming, which is well suited for solving many of the challenges of data analysis. R allows the user to write powerful, concise, and descriptive code.
In many ways, a language is successful inasmuch as it creates a platform with which many people can create new things, and R has proven to be very successful in this regard. One key limitation of the S language was that it was only available in a commercial package, but R is free software. Free as in freedom, and free as in free beer.
The copyright for the primary source code for R is held by the R Foundation and is published under General Public License (GPL). According to the Free Software Foundation (http://www.fsf.org/), with free software (free as in freedom) you are granted the following four freedoms:
Freedom 0
: Run the program for any purpose
Freedom 1
: Study how the program works and adapt it to your needs
Freedom 2
: Redistribute copies so you can help your neighbor
Freedom 3
: Improve the program and release your improvements to the public
These freedoms have allowed R to develop strong prolific communities that include world-class statisticians and programmers as well as many volunteers, who help improve and extend the language. They also allow for R to be developed and maintained for all popular operating systems, and to be easily used by individuals and organizations who wish to do so, possibly sharing their findings in a way that others can replicate their results. Such is the power of free software.
No programming language or system is perfect. R certainly has a number of drawbacks, the most common being that it can be painfully slow (when not used correctly). Keep in mind that R is essentially based on 40-year-old technology, going back to the original S system developed at Bell Labs. Therefore, several of its imperfections come from the fact that it was not built in anticipation for the data age we live in now. When R was born, disk and RAM were very expensive and the internet was just getting started. Notions of large-scale data analysis and high-performance computing were rare.
Fast-forward to the present, hardware cost is just a fraction of what it used to be, computing power is available online for pennies, and everyone is interested in collecting and analyzing data at large scale. This surge in data analysis has brought to the forefront two of R's fundamental limitations, the fact that it's single-threaded and memory-bound. These two characteristics drastically slow it down. Furthermore, R is an interpreted dynamically typed language, which can make it even slower. And finally, R has object immutability and various ways to implement object-oriented programming, both of which can make it hard for people, specially those coming from other languages, to write high-quality code if they don't know how to deal with them. You should know that all of the characteristics mentioned in this paragraph are addressed in Chapter 9, Implementing an Efficient Simple Moving Average.
A double-edged sword in R, is that most of its users do not think of themselves as programmers, and are more concerned with results than with process (which is not necessarily a bad thing). This means that much of the R code you can find online is written without regard for elegance, speed, or readability, since most R users do not revise their code to address these shortcomings. This permeates into code that is patchy and not rigorously tested, which in turn produces many edge cases that you must take into account when using low-quality packages. You will do well to keep this in mind.
My intention for this section is not to provide a comprehensive comparison between R and other software, but to simply point out a few of R's most noticeable features. If you can, I encourage you to test other software yourself so that you know first-hand what may be the best tool for the job at hand.
The most noticeable feature of R compared to other statistical software such as SAS, Stata, SPSS, and even Python, is the very large number of packages that it has available. At the time of writing this, there are almost 12,000 packages published in The Comprehensive R Archive Network (CRAN) (https://cran.r-project.org/), and this does not include packages published in other places, such as Git repositories. This enables R to have a very large community and a huge number of tools for data analysis in areas such as finance, mathematics, machine learning, high-performance computing, and many others.
With the exception of Python, R has much more programming capabilities than SAS, Stata, SPSS, and even more so than Python in some respects (for example, in R, you may use different object models). However, efficient and effective R usage requires the use of code which implies a steep learning curve for some people, while Stata and SPSS have graphical user interfaces that guide the user through many of the tasks with point-and-click wizards. In my opinion, this hand-holding, although nice for beginners, quickly becomes an important restriction for people looking to become intermediate to advanced users, and that's where the advantage of programming really shines.
R has one of the best graphics systems among all existing software. The most popular package for producing graphs in R, which we will use extensively in this book, is the ggplot2 package, but there are many other great graphing packages as well. This package allows the modification of virtually every aspect of a graph through its graphics grammar, and is far superior to anything I've seen in SPSS, Stata, SAS, or even Python.
R is a great tool, but it's not the right tool for everything. If you're looking to perform data analysis but don't want to invest the time in learning to program, then software like SAS, Stata, or SPSS may be a better option for you. If you're looking to develop analytical software that is very easily integrated into larger systems and which needs to plug into various interfaces, then Python may be a better tool for the job. However, if you're looking to do a lot of complex data analysis and graphing, and you are going to mostly spend your time focused on these areas, then R is a great choice.
As I mentioned earlier, R is an interpreted language. When you enter an expression into the R console or execute an R script in your operating system's terminal, a program called the interpreter parses and executes the code. Other examples of interpreted languages are Lisp, Python, and JavaScript. Unlike C, C++, and Java, R doesn't require you to explicitly compile your programs before you execute them.
All R programs are composed of a series of expressions. The interpreter begins by parsing each expression, substituting objects for symbols where appropriate, evaluates them, and finally return the resulting objects. We will define each of these concepts in the following sections, but you should understand that this is the basic process through which all R programs go through.
The R console is the most important tool for using R and can be thought of as a wrapper around the interpreter. The console is a tool that allows you to type expressions directly into R and see how it responds. The interpreter will read the expressions and respond with a result or an error message, if there was one. When you execute expressions through the console, the interpreter will pass objects to the print() function automatically, which is why you can see the result printed below your expressions (we'll cover more on functions later).
If you've used a command line before (for example, bash in Linux of macOS or cmd.exe in Windows) or a language with an interactive interpreter such as Lisp, Python, or JavaScript, the console should look familiar since it simply is a command-line interface. If not, don't worry. Command-line interfaces are simple to use tools. They are programs that receive code and return objects whose printed representations you see below the code you execute.
When you launch R, you will see a window with the R console. Inside the console you will see a message like the one shown below. This message displays some basic information, including the version of R you're running, license information, reminders about how to get help, and a Command Prompt.
You should note that, by default, R will display a greater-than sign (>) at the beginning of the last line of the console, signaling you that it's ready to receive commands. Since R is prompting you to type something, this is called a Command Prompt. When you see the greater-than symbol, R is able to receive more expressions as input. When you don't, it is probably because R is busy processing something you sent, and you should wait for it to finish before sending something else.
For the most part, in this book we will avoid showing such command prompts at all, since you may be typing the code into a source code file or directly into the console, but if we do introduce it, make sure that you don't explicitly type it. For example, if you want to replicate the following snippet, you should only type 1 + 2 in your console, and press the Enter key. When you do, you will see a [1] 3 which is the output you received back from R. Go ahead and execute various arithmetic expressions to get a feel for the console:
> 1 + 2 [1] 3
Finally, you should know that the console provides tools for looking through previous commands. You will probably find that the up and down arrow keys are the most useful. You can scroll through previous commands by pressing them. The up arrow lets you look at earlier commands, and the down arrow lets you look at later commands. If you would like to repeat a previous command with a minor change, or if you need to correct a mistake, you can easily do so using these keys.
In this section we discuss the tools that will help us when working with R.
For efficient code development, you may want to try a more powerful editor or an Integrated Development Environment (IDE). The most popular IDE for R is RStudio (https://www.rstudio.com/). It offers an impressive feature set that makes interacting with R much easier. If you're new to R, and programming in general, this is probably the way to go. As you can see in the image below it wraps the console (right side) within a larger application which offers a lot of functionality, and in this case, it is displaying the help system (left side). Furthermore, RStudio offers tabs to navigate files, browse installed packages, visualize plots, among other features, as well as a large amount of configuration options under the top menu dropdowns.
Throughout this book, we will not use any functionality provided by RStudio. All I will show you is pure R functionality. I decided to proceed this way to make sure that the book is useful for any R programmer, including those who do not use RStudio. For RStudio users, this means that there may be easier ways to accomplish some of the tasks I will show, and instead of programming a few lines, you could simply click some buttons. If that's something you prefer, I encourage you to take a look through the excellent RStudio Essential webinars,which can be found in RStudio's website at https://www.rstudio.com/resources/webinars/?wvideo=lxel3j2kos, as well as Stanford's Introduction to R, RStudio (https://web.stanford.edu/class/stats101/intro/intro-lab01.html).
Experienced programmers may prefer to work with other tools they already know and love and have used for many years. For example, in my case, I prefer to use Emacs (https://www.gnu.org/software/emacs/) for any programming I do. Emacs is a very powerful text editor that you can programatically extend to work the way you want it to by using a programming language known as Elisp, which is a Lisp extension. In case you use Emacs too, the ess package is all you really need.
If you're going to use Emacs, I encourage you to take a look through the ess package's documentation (https://ess.r-project.org/Manual/ess.html) and Johnson's presentation titled Emacs Has No Learning Curve, University of Kansas, 2015 (http://pj.freefaculty.org/guides/Rcourse/emacs-ess/emacs-ess.pdf). If you use Vim, Sublime Text, Atom, or other similar tools, I'm confident you can find useful packages as well.
The base R installation provides the console environment we mentioned in the previous section. This console is really all you need to work with R, but it will quickly become cumbersome to type everything directly into it and it may not be your best option. To efficiently work with R, you need to be able to experiment and iterate as fast as you can. Doing so will accelerate your learning curve and productivity.
Whichever tool you use, the key functionality you need is to be able to easily send code snippets into the console without having to type them yourself, or copying them from your editor and pasting them into the console. In RStudio, you can accomplish this by clicking on the Run or Source button in the top-right corner of the code editor panel. In Emacs, you may use the ess-eval-region command.
One of the most productive ways to work with R, especially when learning it, is to use the write-execute loop, which makes use of the send to console functionality mentioned in the previous section. This will allow you to do two very important things: develop your code through small and quick iterations, which allow you to see step-by-step progress until you converge to the behavior you seek, and save the code you converged to as your final result, which can be easily reproduced using the source code file you used for your iterations. R source code files use the .R extension.
Assuming you have a source code file ready to send expressions to the console, the basic steps through the write-execute loop are as follows:
Define what behavior you're looking to implement with code.
Write the minimal amount of code necessary to achieve one piece of the behavior you seek in your implementation.
Use the send to console functionality to verify that the result in the console is what you expected, and if it's not, to identify possible causes.
If it's not what you expected, go back to the second step with the purpose of fixing the code until it has the intended piece of behavior.
If it's what you expected, go back to the second step with the purpose of extending the code with another piece of the behavior, until convergence.
This write-execute loop will become second nature to you as you start using it, and when it does, you'll be a more productive R programmer. It will allow you to diagnose issues faster, to quickly experiment with a few ways to accomplishing the same behavior to find which one seems best for your context, and once you have working code, it will also allow you to clean your implementation to keep the same behavior but have better or more readable code.
I encourage you to use this write-execute loop to work through the examples presented in this book. At times, we will show step-by-step progress so that you understand the code better, but it's practically impossible to show all of the write-execute loop iterations I went through to develop it, and much of the knowledge you can acquire comes from iterating this way.
Once your code has the functionality you were looking to implement, executing it through an interactive session using the console may not be the best way to do so. In such cases, another option you have is to tell your computer to directly execute the code for you, in a non-interactive session. This means that you won't be able to type commands into the console, but you'll get the benefit of being able to configure your computer to automatically execute code for you, or to integrate it into larger systems where R is only one of many components. This is known as batch mode.
To execute code in the batch mode, you have two options: the old R CMD BATCH command which we won't look into, and the newer Rscript command, which we will. The Rscript is a command that you can execute within your computer's terminal. It receives the name of a source code file and executes its contents.
Suppose you have the following code in a file named greeting.R. It gets the arguments passed through the command line to Rscript through the args object created with the commandArgs() function, assigns the corresponding values to the greeting and name variables, and finally prints a vector that contains those values.
args <- commandArgs(TRUE) greeting <- args[1] name <- args[2] print(c(greeting, name))
Once ready, you may use the Rscript command to execute it from your Terminal (not from within your R console) as is shown ahead. The result shows the vector with the greeting and name variable values you passed it.
$ Rscript greeting.R Hi John [1] "Hi" "John"
Note that if you simply execute the file without any arguments, they will be passed as NA values, which allows you to customize your code to deal with such situations:
$ Rscript greeting.R [1] NA NA
This was a very simple example, but the same mechanism can be used to execute much more complex systems, like the one we will build in the final chapters of this book to constantly retrieve real-time price data from remote servers.
Finally, if you want to provide a mechanism that is closer to the one in Python, you may want to look into the optparse package to create command-line help pages as well as to parse arguments.
To make the most out of this book, you should recreate on your own the examples shown throughout, and make sure that you understand what each of them is doing in detail. If at some point you feel confused, it's not too difficult to do a couple of searches online to clarify things for yourself. However, I highly recommend that you look into the following books as well, which go into more detail on some of the concepts and ideas presented in this book, and are considered very good references for R programmers:
R in a Nutshell, by Adler, O'Reilly, 2010
The Art of R Programming, by Matloff, No Starch Press, 2011
Advanced R, by Wickham, CRC Press, 2015
R Programming for Data Science, by Peng, LeanPub, 2016
Sometimes all you need to do to clarify something is use R's help system. To get help on a function, you may use the question mark notation, like ?function_name, but in case you want to search for help on a topic, you may use the help.search() function, like help.search(regression). This can be helpful if you know what topic you're interested in but can't remember the actual name of the function you want to use. Another way of invoking such functionality is using the double question mark notation, like ?? regression.
Keep in mind that topics in this book are interconnected and not linearly ordered, which means that at times it will seem that we are jumping around. When that happens, it's because a topic can be seen through different points of view. That's why, to make the most out of this book, you should experiment as much as you can in the console and build code progressively using the write-execute loop mentioned earlier. If you simply replicate the code exactly as is shown, you may miss some of the learning that you could have gotten had you built the systems step by step.
Finally, you should know that this book is meant to show how to use R through somewhat real examples, and as such, does not provide too much technical depth or discussion on some of the topics presented. Furthermore, since my objective is to get you quickly working with the real examples, in this first chapter, I explain R fundamentals very briefly, just to introduce the minimum amount of knowledge you need to follow through the real examples presented in the following chapters. Therefore, you should not think that the explanations presented in this chapter are enough for you to understand R's basic constructs. If you're looking for a more in-depth introduction to R fundamentals, you may want to take a look at the references we mentioned previously.
Like most programming languages, R lets you assign values to variables and refer to these objects by name. The names you use to refer to variables are called symbols in R. This allows you to keep some information available in case it's needed at a later point in time. These variables may contain any type of object available in R, even combinations of them when using lists, as we will see in a later section in this chapter. Furthermore, these objects are immutable, but that's a topic for Chapter 9, Implementing an Efficient Simple Moving Average.
In R, the assignment operator is <-, which is a less-than symbol (<) followed by a dash (-). If you have worked with algorithm pseudo code before, you may find it familiar. You may also use the single equals symbol () for assignments, similar to many other languages, but I prefer to stick to the <- operator.
An expression like x <- 1 means that the value 1 is assigned to the x symbol, which can be thought of as a variable. You can also assign the other way around, meaning that with an expression like 1 -> x we would have the same effect as we did earlier. However, the assignment from left to right is very rarely used, and is more of a convenience feature in case you forget the assignment operator at the beginning of a line in the console.
Note that the value substitution is done at the time when the value is assigned to z, not at the time when z is evaluated. If you enter the following code into the console, you can see that the second time z is printed, it still has the value that y had when it was used to assign to it, not the y value assigned afterward:
x <- 1 y <- 2 z <- c(x, y) z
#> [1] 1 2
y <- 3 z
#> [1] 1 2
It's easy to use variable names like x, y, and z, but using them has a high cost for real programs. When you use names like that, you probably have a very good idea of what values they will contain and how they will be used. In other words, their intention is clear for you. However, when you give your code to someone else or come back to it after a long period of time, those intentions may not be clear, and that's when cryptic names can be harmful. In real programs, your names should be self descriptive and instantly communicate intention.
For a deeper discussion about this and many other topics regarding high-quality code, take a look at Martin's excellent book Clean Code: A Handbook of Agile Software Craftsmanship, Prentice Hall, 2008.
Standard object names in R should only contain alphanumeric characters (numbers and ASCII letters), underscores (_), and, depending on context, even periods (.). However, R will allow you to use very cryptic strings if you want. For example, in the following code, we show how the variable !A @B #C $D %E ^F name is used to contain a vector with three integers. As you can see, you are even allowed to use spaces. You can use this non-standard name provided that you wrap the string with backticks (`):
`!A @B #C $D %E ^F` <- c(1, 2, 3) `!A @B #C $D %E ^F`
#> [1] 1 2 3
It goes without saying that you should avoid those names, but you should be aware they exist because they may come in handy when using some of R's more advanced features. These types of variable names are not allowed in most languages, but R is flexible in that way. Furthermore, the example goes to show a common theme around R programming: it is so flexible that if you're not careful, you will end up shooting yourself in the foot. It's not too rare for someone to be very confused about some code because they assumed R would behave a certain way (for example, raise an error under certain conditions) but don't explicitly test for such behavior, and later find that it behaves differently.
This section summarizes the most important data types and data structures in R. In this brief overview, we won't discuss them in depth. We will only show a couple of examples that will allow you to understand the code shown throughout this book. If you want to dig deeper into them, you may look into their documentation or some of the references pointed out in this chapter's introduction.
The basic data types in R are numbers, text, and Boolean values (TRUE or FALSE), which R calls numerics, characters, and logicals, respectively. Strictly speaking, there are also types for integers, complex numbers, and raw data (bytes), but we won't use them explicitly in this book. The six basic data structures in R are vectors, factors, matrices, data frames, and lists, which we will summarize in the following sections.
Numbers in R behave pretty much as you would mathematically expect them to. For example, the operation 2 / 3
