R Programming By Example - Omar Trejo Navarro - E-Book

R Programming By Example E-Book

Omar Trejo Navarro

0,0
45,59 €

-100%
Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.

Mehr erfahren.
Beschreibung

This step-by-step guide demonstrates how to build simple-to-advanced applications through examples in R using modern tools.

About This Book

  • Get a firm hold on the fundamentals of R through practical hands-on examples
  • Get started with good R programming fundamentals for data science
  • Exploit the different libraries of R to build interesting applications in R

Who This Book Is For

This books is for aspiring data science professionals or statisticians who would like to learn about the R programming language in a practical manner. Basic programming knowledge is assumed.

What You Will Learn

  • Discover techniques to leverage R's features, and work with packages
  • Perform a descriptive analysis and work with statistical models using R
  • Work efficiently with objects without using loops
  • Create diverse visualizations to gain better understanding of the data
  • Understand ways to produce good visualizations and create reports for the results
  • Read and write data from relational databases and REST APIs, both packaged and unpackaged
  • Improve performance by writing better code, delegating that code to a more efficient programming language, or making it parallel

In Detail

R is a high-level statistical language and is widely used among statisticians and data miners to develop analytical applications. Often, data analysis people with great analytical skills lack solid programming knowledge and are unfamiliar with the correct ways to use R. Based on the version 3.4, this book will help you develop strong fundamentals when working with R by taking you through a series of full representative examples, giving you a holistic view of R.

We begin with the basic installation and configuration of the R environment. As you progress through the exercises, you'll become thoroughly acquainted with R's features and its packages. With this book, you will learn about the basic concepts of R programming, work efficiently with graphs, create publication-ready and interactive 3D graphs, and gain a better understanding of the data at hand. The detailed step-by-step instructions will enable you to get a clean set of data, produce good visualizations, and create reports for the results. It also teaches you various methods to perform code profiling and performance enhancement with good programming practices, delegation, and parallelization.

By the end of this book, you will know how to efficiently work with data, create quality visualizations and reports, and develop code that is modular, expressive, and maintainable.

Style and Approach

This is an easy-to-understand guide filled with real-world examples, giving you a holistic view of R and practical, hands-on experience.

Sie lesen das E-Book in den Legimi-Apps auf:

Android
iOS
von Legimi
zertifizierten E-Readern

Seitenzahl: 645

Veröffentlichungsjahr: 2017

Bewertungen
0,0
0
0
0
0
0
Mehr Informationen
Mehr Informationen
Legimi prüft nicht, ob Rezensionen von Nutzern stammen, die den betreffenden Titel tatsächlich gekauft oder gelesen/gehört haben. Wir entfernen aber gefälschte Rezensionen.



R Programming By Example

 

 

 

 

 

 

 

Practical, hands-on projects to help you get started with R

 

 

 

 

 

 

 

 

 

 

Omar Trejo Navarro

BIRMINGHAM - MUMBAI

R Programming By Example

Copyright © 2017 Packt Publishing

 

All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing, and its dealers and distributors will be held liable for any damages caused or alleged to be caused directly or indirectly by this book.

Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

 

First published: December 2017

Production reference: 1201217

 

 

Published by Packt Publishing Ltd.
Livery Place
35 Livery Street
Birmingham
B3 2PB, UK.

ISBN 978-1-78829-254-2

 

www.packtpub.com

Credits

Author

 

Omar Trejo Navarro

Copy Editor

 

Pranjali Chury

Reviewer

 

Peter C. Figliozzi

Project Coordinator

 

Vaidehi Sawant

Commissioning Editor

 

Merint Mathew

Proofreader

 

Safis Editing

Acquisition Editor

 

Karan Sadawana

Indexer

 

Tejal Daruwale Soni

Content Development Editor

 

Rohit Kumar Singh

Graphics

 

Jason Monteiro

Technical Editor

 

Ruvika Rao

Production Coordinator

 

Shraddha Falebhai

About the Author

Omar Trejo Navarro is a data consultant. He co-founded Datata (datata.mx), is actively working on CVEST (cvest.tech), and maintains a personal website (otrenav.com). He is an applied mathematics and economics double major from ITAM (itam.mx) in Mexico City, where he continues to work as a research assistant. He does software development with a focus on data platforms, data science, and web applications. He has worked with clients from all over the world, and is a keen supporter of open source, open data, and open science in general. He can be reached through his personal website (otrenav.com).

 

 

This book is the product of combined efforts of many people. First of all, I'd like to thank my loved ones for their continued support and patience for my lack of availability. Next, I'd like to thank Peter C. Figliozzi for his valuable comments and feedback.
Also, I'd like to thank Rohit Kumar Singh and Ruvika Rao for their continued support, collaboration, and help during the production of this book.
Finally, I'd like to thank R's amazing community and the innumerable people who contribute to open knowledge through the internet. And, of course, I'd like to thank you, the reader, for picking up this book! I hope it's valuable to you.

About the Reviewer

Peter C. Figliozzi, PhD, is a professional data scientist and software developer. He works on problems in many areas, including anomaly detection, automated trading, and fraud prevention. Peter uses R through RStudio for ad hoc analysis, modeling, and visualization.

www.PacktPub.com

For support files and downloads related to your book, please visit www.PacktPub.com.

Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.PacktPub.com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at [email protected] for more details.

At www.PacktPub.com, you can also read a collection of free technical articles, sign up for a range of free newsletters and receive exclusive discounts and offers on Packt books and eBooks.

https://www.packtpub.com/mapt

Get the most in-demand software skills with Mapt. Mapt gives you full access to all Packt books and video courses, as well as industry-leading tools to help you plan your personal development and advance your career.

Why subscribe?

Fully searchable across every book published by Packt

Copy and paste, print, and bookmark content

On demand and accessible via a web browser

Customer Feedback

Thanks for purchasing this Packt book. At Packt, quality is at the heart of our editorial process. To help us improve, please leave us an honest review on this book's Amazon page at https://www.amazon.com/dp/1788292545.

If you'd like to join our team of regular reviewers, you can email us at [email protected]. We award our regular reviewers with free eBooks and videos in exchange for their valuable feedback. Help us be relentless in improving our products!

Table of Contents

Preface

What this book covers

What you need for this book

Who this book is for

Conventions

Reader feedback

Customer support

Downloading the example code

Downloading the color images of this book

Errata

Piracy

Questions

Introduction to R

What R is and what it isn't

The inspiration for R – the S language

R is a high quality statistical computing system

R is a flexible programming language

R is free, as in freedom and as in free beer

What R is not good for

Comparing R with other software

The interpreter and the console

Tools to work efficiently with R

Pick an IDE or a powerful editor

The send to console functionality

The efficient write-execute loop

Executing R code in non-interactive sessions

How to use this book

Tracking state with symbols and variables

Working with data types and data structures

Numerics

Special values

Characters

Logicals

Vectors

Factors

Matrices

Lists

Data frames

Divide and conquer with functions

Optional arguments

Functions as arguments

Operators are functions

Coercion

Complex logic with control structures

If… else conditionals

For loops

While loops

The examples in this book

Summary

Understanding Votes with Descriptive Statistics

This chapter's required packages

The Brexit votes example

Cleaning and setting up the data

Summarizing the data into a data frame

Getting intuition with graphs and correlations

Visualizing variable distributions

Using matrix scatter plots for a quick overview

Getting a better look with detailed scatter plots

Understanding interactions with correlations

Creating a new dataset with what we've learned

Building new variables with principal components

Putting it all together into high-quality code

Planning before programming

Understanding the fundamentals of high-quality code

Programming by visualizing the big picture

Summary

Predicting Votes with Linear Models

Required packages

Setting up the data

Training and testing datasets

Predicting votes with linear models

Checking model assumptions

Checking linearity with scatter plots

Checking normality with histograms and quantile-quantile plots

Checking homoscedasticity with residual plots

Checking no collinearity with correlations

Measuring accuracy with score functions

Programatically finding the best model

Generating model combinations

Predicting votes from wards with unknown data

Summary

Simulating Sales Data and Working with Databases

Required packages

Designing our data tables

The basic variables

Simplifying assumptions

Potential pitfalls

The too-much-empty-space problem

The too-much-repeated-data problem

Simulating the sales data

Simulating numeric data according to distribution assumptions

Simulating categorical values using factors

Simulating dates within a range

Simulating numbers under shared restrictions

Simulating strings for complex identifiers

Putting everything together

Simulating the client data

Simulating the client messages data

Working with relational databases

Summary

Communicating Sales with Visualizations

Required packages

Extending our data with profit metrics

Building blocks for reusable high-quality graphs

Starting with simple applications for bar graphs

Adding a third dimension with colors

Graphing top performers with bar graphs

Graphing disaggregated data with boxplots

Scatter plots with joint and marginal distributions

Pricing and profitability by protein source and continent

Client birth dates, gender, and ratings

Developing our own graph type – radar graphs

Exploring with interactive 3D scatter plots

Looking at dynamic data with time-series

Looking at geographical data with static maps

Navigating geographical data with interactive maps

Maps you can navigate and zoom-in to

High-tech-looking interactive globe

Summary

Understanding Reviews with Text Analysis

This chapter's required packages

What is text analysis and how does it work?

Preparing, training, and testing data

Building the corpus with tokenization and data cleaning

Document feature matrices

Training models with cross validation

Training our first predictive model

Improving speed with parallelization

Computing predictive accuracy and confusion matrices

Improving our results with TF-IDF

Adding flexibility with N-grams

Reducing dimensionality with SVD

Extending our analysis with cosine similarity

Digging deeper with sentiment analysis

Testing our predictive model with unseen data

Retrieving text data from Twitter

Summary

Developing Automatic Presentations

Required packages

Why invest in automation?

Literate programming as a content creation methodology

Reproducibility as a benefit of literate programming

The basic tools for an automation pipeline

A gentle introduction to Markdown

Text

Headers

Header Level  1

Header Level  2

Header Level  3

Header Level  4

Lists

Tables

Links

Images

Quotes

Code

Mathematics

Extending Markdown with R Markdown

Code chunks

Tables

Graphs

Chunk options

Global chunk options

Caching

Producing the final output with knitr

Developing graphs and analysis as we normally would

Building our presentation with R Markdown

Summary

Object-Oriented System to Track Cryptocurrencies

This chapter's required packages

The cryptocurrencies example

A brief introduction to object-oriented programming

The purpose of object-oriented programming

Important concepts behind object-oriented languages

Encapsulation

Polymorphism

Hierarchies

Classes and constructors

Public and private methods

Interfaces, factories, and patterns in general

Introducing three object models in R – S3, S4, and R6

The first source of confusion – various object models

The second source of confusion – generic functions

The S3 object model

Classes, constructors, and composition

Public methods and polymorphism

Encapsulation and mutability

Inheritance

The S4 object model

Classes, constructors, and composition

Public methods and polymorphism

Encapsulation and mutability

Inheritance

The R6 object model

Classes, constructors, and composition

Public methods and polymorphism

Encapsulation and mutability

Inheritance

Active bindings

Finalizers

The architecture behind our cryptocurrencies system

Starting simple with timestamps using S3 classes

Implementing cryptocurrency assets using S4 classes

Implementing our storage layer with R6 classes

Communicating available behavior with a database interface

Implementing a database-like storage system with CSV files

Easily allowing new database integration with a factory

Encapsulating multiple databases with a storage layer

Retrieving live data for markets and wallets with R6 classes

Creating a very simple requester to isolate API calls

Developing our exchanges infrastructure

Developing our wallets infrastructure

Implementing our wallet requesters

Finally introducing users with S3 classes

Helping ourselves with a centralized settings file

Saving our initial user data into the system

Activating our system with two simple functions

Some advice when working with object-oriented systems

Summary

Implementing an Efficient Simple Moving Average

Required packages

Starting by using good algorithms

Just how much impact can algorithm selection have?

How fast is fast enough?

Calculating simple moving averages inefficiently

Simulating the time-series 

Our first (very inefficient) attempt at an SMA

Understanding why R can be slow

Object immutability

Interpreted dynamic typings

Memory-bound processes

Single-threaded processes

Measuring by profiling and benchmarking

Profiling fundamentals with Rprof()

Benchmarking manually with system.time()

Benchmarking automatically with microbenchmark()

Easily achieving high benefit - cost improvements

Using the simple data structure for the job

Vectorizing as much as possible

Removing unnecessary logic

Moving checks out of iterative processes

If you can, avoid iterating at all

Using R's way of iterating efficiently

Avoiding sending data structures with overheads

Using parallelization to divide and conquer

How deep does the parallelization rabbit hole go?

Practical parallelization with R

Using C++ and Fortran to accelerate calculations

Using an old-school approach with Fortran

Using a modern approach with C++

Looking back at what we have achieved

Other topics of interest to enhance performance

Preallocating memory to avoid duplication

Making R code a bit faster with byte code compilation

Just-in-time (JIT) compilation of R code

Using memoization or cache layers

Improving our data and memory management

Using specialized packages for performance

Flexibility and power with cloud computing

Specialized R distributions

Summary

Adding Interactivity with Dashboards

Required packages

Introducing the Shiny application architecture and reactivity

What is functional reactive programming and why is it useful?

How is functional reactivity handled within Shiny?

The building blocks for reactivity in Shiny

The input, output, and rendering functions

Designing our high-level application structure

Setting up a two-column distribution

Introducing sections with panels

Inserting a dynamic data table

Introducing interactivity with user input

Setting up static user inputs

Setting up dynamic options in a drop-down

Setting up dynamic input panels

Adding a summary table with shared data

Adding a simple moving average graph

Adding interactivity with a secondary zoom-in graph

Styling our application with themes

Other topics of interest

Adding static images

Adding HTML to your web application

Adding custom CSS styling

Sharing your newly created application

Summary

Required Packages

External requirements – software outside of R

Dependencies for the RMySQL R package

Ubuntu 17.10

macOS High Sierra

Setting up user/password in both Linux and macOS

Dependencies for the rgl and rgdal R packages

Ubuntu 17.10

macOS High Sierra

Dependencies for the Rcpp package and the .Fortran() function

Ubuntu 17.10

macOS High Sierra

Internal requirements – R packages

Loading R packages

Preface

In a world where data is becoming increasingly important, data analysts, scientists, and business people need tools to analyze and process large volumes of data efficiently. This book is my attempt to pass on what I've learned so far, so that you can quickly become an effective and efficient R programmer. Reading it will help you understand how to use R to solve complex problems, avoid some of the mistakes I've made, and teach you useful techniques that can be helpful in a variety of contexts. In the process, I hope to show you that, despite its uncommon aspects, R is an elegant and powerful language, and is well suited for data analysis and statistics, as well as complex systems.

After reading this book, you will be familiar with R's fundamentals, as well as some of its advanced features. You will understand data structures, and you will know how to efficiently deal with them. You will also understand how to design complex systems that perform efficiently, and how to make these systems usable by other people through web applications. At a lower level, you will understand how to work with object-oriented programming, functional programming, and reactive programming, and what code may be better written in each of these paradigms. You will learn how to use various cutting edge tools that R provides to develop software, how to identify performance bottlenecks, and how to fix them, possibly using other programming languages such as Fortran and C++. Finally, you will be comfortable reading and understanding the majority of R code, as well as provide feedback for others' code.

What this book covers

Chapter 1, Introduction to R, covers the R basics you need to understand the rest of the examples. It is not meant to be a thorough introduction to R. Rather, it's meant to give you the very basic concepts and techniques you need to quickly get started with the three examples contained in the book, and which I introduce next.

This book uses three examples to showcase R's wide range of functionality. The first example shows how to analyze votes with descriptive statistics and linear models, and it is presented in Chapter 2, Understanding Votes with Descriptive Statistics and Chapter 3, Predicting Votes with Linear Models.

Chapter 2, Understanding Votes with Descriptive Statistics, shows how to programatically create hundreds of graphs to identify relations within data visually. It shows how to create histograms, scatter plots, correlation matrices, and how to perform Principal Component Analysis (PCA).

Chapter 3, Predicting Votes with Linear Models, shows how to programatically find the best predictive linear model for a set of data, and according to different success metrics. It also shows how to check model assumptions, and how to use cross validation to increase confidence in your results.

The second example shows how to simulate data, visualize it, analyze its text components, and create automatic presentations with it.

Chapter 4, Simulating Sales Data and Working with Databases, shows how to design data schema and simulate the various types of data. It also shows how to integrate real text data with simulated data, and how to use a SQL database to access it more efficiently.

Chapter 5, Communicating Sales with Visualization, shows how to produce basic to advanced graphs, highly customized graphs. It also shows how to create dynamic 3D graphs and interactive maps.

Chapter 6, Understanding Reviews with Text Analysis, shows how to perform text analysis step by step using Natural Language Processing (NLP) techniques, as well as sentiment analysis.

Chapter 7, Developing Automatic Presentations, shows how to put together the results of previous chapters to create presentations that can be automatically updated with the latest data using tools such as knitr and R Markdown.

Finally, the third example shows how to design and develop complex object-oriented systems that retrieve real-time data from cryptocurrency markets, as well as how to optimize implementations and how to build web applications around such systems.

Chapter 8, Object-Oriented System to Track Cryptocurrencies, introduces basic object-oriented techniques that produce complex systems when combined. Furthermore, it shows how to work with three of R’s most used object models, which are S3, S4, and R6, as well as how to make them work together.

Chapter 9, Implementing an Efficient Simple Moving Average, shows how to iteratively improve an implementation for a Simple Moving Average (SMA), starting with what is considered to be bad code, all the way to advanced optimization techniques using parallelization, and delegation to the Fortran and C++ languages.

Chapter 10, Adding Interactivity with Dashboards, shows how to wrap what was built during the previous two chapters to produce a modern web application using reactive programming through the Shiny package.

Appendix, Required Packages, shows how to install the internal and external software necessary to replicate the examples in the book. Specifically, it will walk through the installation processes for Linux and macOS, but Windows follows similar principles and should not cause any problems.

What you need for this book

This book was written in a Linux environment (specifically Ubuntu 17.10), and was also tested with a macOS, High Sierra. Even though it was not tested on a Windows computer, all of the R code presented in this book should work fine with one. The only substantial difference is that when I show you how to perform a task using a Terminal, it will be the bash terminal, which is available in Linux and macOS by default. In the case of Windows, you will need to use the cmd.exe terminal, for which you can find a lot of information online. Keep in mind that if you're using a Windows computer, you should be prepared to do a bit more research on your end to replicate the same functionality, but you should not have much trouble at all.

In the appendix, I show you how to install the software you need to replicate the examples shown in this book. I show you how to do so for Linux and macOS, specifically Ubuntu 17.10 and High Sierra. If you're using Windows, the same principles apply but the specifics may be a bit different. However, I'm sure it will not be too hard in any case.

There are two types of requirements you need to be able to execute all the code in this book: external and internal. Software outside of R is what I call external requirements. Software inside of R, meaning R packages, is what I refer to as internal requirements. I walk you through the installation of both of them in the appendix.

Who this book is for

This book is for those who wish to develop software in R. You don't need to be an expert or professional programmer to follow this book, but you do need to be interested in learning how R works. My hope is that this book is useful for people ranging from beginners to advanced by providing hands-on examples that may help you understand R in ways you previously did not.

I assume basic programming, mathematical, and statistical knowledge, because there are various parts in the book where concepts from these disciplines will be used, and they will not be explained in detail. If you have programmed something yourself in any programming language, know basic linear algebra and statistics, and know what linear regression is, you have everything you need to understand this book.

This book was written for people in a variety of contexts and with diverse profiles. For example, if you are an analyst employed by an organization that requires you to do frequent data processing to produce reports on a regular basis, and you need to develop programs to automate such tasks, this book is for you. If you are an academic researcher who wants to use current techniques, combine them, and develop tools to test them automatically, this book is for you. If you're a professional programmer looking for ways to take advantage of advanced R features, this book is for you. Finally, if you're preparing for a future in which data will be of paramount importance (it already is), this book is for you.

Reader feedback

Feedback from our readers is always welcome. Let us know what you think about this book-what you liked or disliked. Reader feedback is important for us as it helps us develop titles that you will really get the most out of. To send us general feedback, simply email [email protected], and mention the book's title in the subject of your message. If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, see our author guide at www.packtpub.com/authors.

Customer support

Now that you are the proud owner of a Packt book, we have a number of things to help you to get the most from your purchase.

Downloading the example code

You can download the example code files for this book from your account at http://www.packtpub.com. If you purchased this book elsewhere, you can visit http://www.packtpub.com/support and register to have the files emailed directly to you. You can download the code files by following these steps:

Log in or register to our website using your email address and password.

Hover the mouse pointer on the

SUPPORT

tab at the top.

Click on

Code Downloads & Errata

.

Enter the name of the book in the

Search

box.

Select the book for which you're looking to download the code files.

Choose from the drop-down menu where you purchased this book from.

Click on

Code Download

.

Once the file is downloaded, please make sure that you unzip or extract the folder using the latest version of:

WinRAR / 7-Zip for Windows

Zipeg / iZip / UnRarX for Mac

7-Zip / PeaZip for Linux

The code bundle for the book is also hosted on GitHub at https://github.com/PacktPublishing/R-Programming-By-Example. We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/. Check them out!

Downloading the color images of this book

We also provide you with a PDF file that has color images of the screenshots/diagrams used in this book. The color images will help you better understand the changes in the output. You can download this file from https://www.packtpub.com/sites/default/files/downloads/RProgrammingByExample_ColorImages.pdf.

Errata

Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you find a mistake in one of our books-maybe a mistake in the text or the code-we would be grateful if you could report this to us. By doing so, you can save other readers from frustration and help us improve subsequent versions of this book. If you find any errata, please report them by visiting http://www.packtpub.com/submit-errata, selecting your book, clicking on the Errata Submission Form link, and entering the details of your errata. Once your errata are verified, your submission will be accepted and the errata will be uploaded to our website or added to any list of existing errata under the Errata section of that title.

To view the previously submitted errata, go to https://www.packtpub.com/books/content/support and enter the name of the book in the search field. The required information will appear under the Errata section.

Piracy

Piracy of copyrighted material on the internet is an ongoing problem across all media. At Packt, we take the protection of our copyright and licenses very seriously. If you come across any illegal copies of our works in any form on the internet, please provide us with the location address or website name immediately so that we can pursue a remedy.

Please contact us at [email protected] with a link to the suspected pirated material. We appreciate your help in protecting our authors and our ability to bring you valuable content.

Questions

If you have a problem with any aspect of this book, you can contact us at [email protected], and we will do our best to address the problem.

Introduction to R

In a world where data is becoming increasingly important, business people and scientists need tools to analyze and process large volumes of data efficiently. R is one of the tools that has become increasingly popular in recent years for data processing, statistical analysis, and data science, and while R has its roots in academia, it is now used by organizations across a wide range of industries and geographical areas.

Some of the important topics covered in this chapter are as follows:

History of R and why it was designed the way it was

What the interpreter and the console are and how to use them

How to work with basic data types and data structures of R

How to divide work by using functions in different ways

How to introduce complex logic with control structures

What R is and what it isn't

When it comes to choosing software for statistical computing, it's tough to argue against R. Who could dislike a high quality, cross-platform, open source, statistical software product? It has an interactive console for exploratory work. It can run as a scripting language to replicate processes. It has a lot of statistical models built in, so you don't have to reinvent the wheel, but when the base toolset is not enough, you have access to a rich ecosystem of external packages. And, it's free! No wonder R has become a favorite in the age of data.

The inspiration for R – the S language

R was inspired by the S statistical language developed by John Chambers at AT&T. The name S is an allusion to another one-letter-name programming language also developed at AT&T, the famous C language. R was created by Ross Ihaka and Robert Gentleman in the Department of Statistics at the University of Auckland in 1991.

The general S philosophy sets the stage for the design of the R language itself, which many programmers coming from other programming languages find somewhat odd and confusing. In particular, it's important to realize that S was developed to make data analysis as easy as possible. 

"We wanted users to be able to begin in an interactive environment, where they did not consciously think of programming. Then as their needs became clearer and their sophistication increased, they should be able to slide gradually into programming, when the language and system aspects would become more important."
– John Chambers

The key part here is the transition from analyst to developer. They wanted to build a language that could easily service both types of users. They wanted to build language that would be suitable for interactive data analysis through a command line but which could also be used to program complex systems, like traditional programming languages.

It's no coincidence that this book is structured that way. We will start doing data analysis first, and we will gradually move toward developing a full and complex system for information retrieval with a web application on top.

R is a high quality statistical computing system

R is comparable, and often superior, to commercial products when it comes to programming capabilities, complex systems development, graphic production, and community ecosystems. Researchers in statistics and machine learning, as well as many other data-related disciplines, will often publish R packages to accompany their publications. This translates into immediate public access to the very latest statistical techniques and implementations. Whatever model or graphic you're trying to develop, chances are that someone has already tried it, and if not, you can at least learn from their efforts.

R is a flexible programming language

As we have seen, in addition to providing statistical tools, R is a general-purpose programming language. You can use R to extend its own functionality, automate processes that make use of complex systems, and many other things. It incorporates features from other object-oriented programming languages and has strong foundations for functional programming, which is well suited for solving many of the challenges of data analysis. R allows the user to write powerful, concise, and descriptive code.

R is free, as in freedom and as in free beer

In many ways, a language is successful inasmuch as it creates a platform with which many people can create new things, and R has proven to be very successful in this regard. One key limitation of the S language was that it was only available in a commercial package, but R is free software. Free as in freedom, and free as in free beer.

The copyright for the primary source code for R is held by the R Foundation and is published under General Public License (GPL). According to the Free Software Foundation (http://www.fsf.org/), with free software (free as in freedom) you are granted the following four freedoms:

Freedom 0

: Run the program for any purpose

Freedom 1

: Study how the program works and adapt it to your needs

Freedom 2

: Redistribute copies so you can help your neighbor

Freedom 3

: Improve the program and release your improvements to the public

These freedoms have allowed R to develop strong prolific communities that include world-class statisticians and programmers as well as many volunteers, who help improve and extend the language. They also allow for R to be developed and maintained for all popular operating systems, and to be easily used by individuals and organizations who wish to do so, possibly sharing their findings in a way that others can replicate their results. Such is the power of free software.

What R is not good for

No programming language or system is perfect. R certainly has a number of drawbacks, the most common being that it can be painfully slow (when not used correctly). Keep in mind that R is essentially based on 40-year-old technology, going back to the original S system developed at Bell Labs. Therefore, several of its imperfections come from the fact that it was not built in anticipation for the data age we live in now. When R was born, disk and RAM were very expensive and the internet was just getting started. Notions of large-scale data analysis and high-performance computing were rare.

Fast-forward to the present, hardware cost is just a fraction of what it used to be, computing power is available online for pennies, and everyone is interested in collecting and analyzing data at large scale. This surge in data analysis has brought to the forefront two of R's fundamental limitations, the fact that it's single-threaded and memory-bound. These two characteristics drastically slow it down. Furthermore, R is an interpreted dynamically typed language, which can make it even slower. And finally, R has object immutability and various ways to implement object-oriented programming, both of which can make it hard for people, specially those coming from other languages, to write high-quality code if they don't know how to deal with them. You should know that all of the characteristics mentioned in this paragraph are addressed in Chapter 9, Implementing an Efficient Simple Moving Average.

A double-edged sword in R, is that most of its users do not think of themselves as programmers, and are more concerned with results than with process (which is not necessarily a bad thing). This means that much of the R code you can find online is written without regard for elegance, speed, or readability, since most R users do not revise their code to address these shortcomings. This permeates into code that is patchy and not rigorously tested, which in turn produces many edge cases that you must take into account when using low-quality packages. You will do well to keep this in mind.

Comparing R with other software

My intention for this section is not to provide a comprehensive comparison between R and other software, but to simply point out a few of R's most noticeable features. If you can, I encourage you to test other software yourself so that you know first-hand what may be the best tool for the job at hand.

The most noticeable feature of R compared to other statistical software such as SAS, Stata, SPSS, and even Python, is the very large number of packages that it has available. At the time of writing this, there are almost 12,000 packages published in The Comprehensive R Archive Network (CRAN) (https://cran.r-project.org/), and this does not include packages published in other places, such as Git repositories. This enables R to have a very large community and a huge number of tools for data analysis in areas such as finance, mathematics, machine learning, high-performance computing, and many others.

With the exception of Python, R has much more programming capabilities than SAS, Stata, SPSS, and even more so than Python in some respects (for example, in R, you may use different object models). However, efficient and effective R usage requires the use of code which implies a steep learning curve for some people, while Stata and SPSS have graphical user interfaces that guide the user through many of the tasks with point-and-click wizards. In my opinion, this hand-holding, although nice for beginners, quickly becomes an important restriction for people looking to become intermediate to advanced users, and that's where the advantage of programming really shines.

R has one of the best graphics systems among all existing software. The most popular package for producing graphs in R, which we will use extensively in this book, is the ggplot2 package, but there are many other great graphing packages as well. This package allows the modification of virtually every aspect of a graph through its graphics grammar, and is far superior to anything I've seen in SPSS, Stata, SAS, or even Python.

R is a great tool, but it's not the right tool for everything. If you're looking to perform data analysis but don't want to invest the time in learning to program, then software like SAS, Stata, or SPSS may be a better option for you. If you're looking to develop analytical software that is very easily integrated into larger systems and which needs to plug into various interfaces, then Python may be a better tool for the job. However, if you're looking to do a lot of complex data analysis and graphing, and you are going to mostly spend your time focused on these areas, then R is a great choice.

The interpreter and the console

As I mentioned earlier, R is an interpreted language. When you enter an expression into the R console or execute an R script in your operating system's terminal, a program called the interpreter parses and executes the code. Other examples of interpreted languages are Lisp, Python, and JavaScript. Unlike C, C++, and Java, R doesn't require you to explicitly compile your programs before you execute them.

All R programs are composed of a series of expressions. The interpreter begins by parsing each expression, substituting objects for symbols where appropriate, evaluates them, and finally return the resulting objects. We will define each of these concepts in the following sections, but you should understand that this is the basic process through which all R programs go through.

The R console is the most important tool for using R and can be thought of as a wrapper around the interpreter. The console is a tool that allows you to type expressions directly into R and see how it responds. The interpreter will read the expressions and respond with a result or an error message, if there was one. When you execute expressions through the console, the interpreter will pass objects to the print() function automatically, which is why you can see the result printed below your expressions (we'll cover more on functions later).

If you've used a command line before (for example, bash in Linux of macOS or cmd.exe in Windows) or a language with an interactive interpreter such as Lisp, Python, or JavaScript, the console should look familiar since it simply is a command-line interface. If not, don't worry. Command-line interfaces are simple to use tools. They are programs that receive code and return objects whose printed representations you see below the code you execute.

When you launch R, you will see a window with the R console. Inside the console you will see a message like the one shown below. This message displays some basic information, including the version of R you're running, license information, reminders about how to get help, and a Command Prompt.

Note that the R version in this case is 3.4.2. The code developed during this book will assume this version. If you have a different version, but in case you end up with some problems, this could be a reason you may want to look into.

You should note that, by default, R will display a greater-than sign (>) at the beginning of the last line of the console, signaling you that it's ready to receive commands. Since R is prompting you to type something, this is called a Command Prompt. When you see the greater-than symbol, R is able to receive more expressions as input. When you don't, it is probably because R is busy processing something you sent, and you should wait for it to finish before sending something else.

For the most part, in this book we will avoid showing such command prompts at all, since you may be typing the code into a source code file or directly into the console, but if we do introduce it, make sure that you don't explicitly type it. For example, if you want to replicate the following snippet, you should only type 1 + 2 in your console, and press the Enter key. When you do, you will see a [1] 3 which is the output you received back from R. Go ahead and execute various arithmetic expressions to get a feel for the console:

> 1 + 2 [1] 3

Note the [1] that accompanies each returned value. It's there because the result is actually a vector (an ordered collection). The [1] means that the index of the first item displayed in that row is 1 (in this case, our resulting vector has a single value within).

Finally, you should know that the console provides tools for looking through previous commands. You will probably find that the up and down arrow keys are the most useful. You can scroll through previous commands by pressing them. The up arrow lets you look at earlier commands, and the down arrow lets you look at later commands. If you would like to repeat a previous command with a minor change, or if you need to correct a mistake, you can easily do so using these keys.

Tools to work efficiently with R

In this section we discuss the tools that will help us when working with R.

Pick an IDE or a powerful editor

For efficient code development, you may want to try a more powerful editor or an Integrated Development Environment (IDE). The most popular IDE for R is RStudio (https://www.rstudio.com/). It offers an impressive feature set that makes interacting with R much easier. If you're new to R, and programming in general, this is probably the way to go. As you can see in the image below it wraps the console (right side) within a larger application which offers a lot of functionality, and in this case, it is displaying the help system (left side). Furthermore, RStudio offers tabs to navigate files, browse installed packages, visualize plots, among other features, as well as a large amount of configuration options under the top menu dropdowns.

Throughout this book, we will not use any functionality provided by RStudio. All I will show you is pure R functionality. I decided to proceed this way to make sure that the book is useful for any R programmer, including those who do not use RStudio. For RStudio users, this means that there may be easier ways to accomplish some of the tasks I will show, and instead of programming a few lines, you could simply click some buttons. If that's something you prefer, I encourage you to take a look through the excellent RStudio Essential webinars,which can be found in RStudio's website at https://www.rstudio.com/resources/webinars/?wvideo=lxel3j2kos, as well as Stanford's Introduction to R, RStudio (https://web.stanford.edu/class/stats101/intro/intro-lab01.html).

You should be careful to avoid the common mistake of referring to R as RStudio. Since many people are introduced to R through RStudio, they think that RStudio is actually R, which it is not. RStudio is a wrapper around R to extend it's functionality, and is technically known as an IDE.

Experienced programmers may prefer to work with other tools they already know and love and have used for many years. For example, in my case, I prefer to use Emacs (https://www.gnu.org/software/emacs/) for any programming I do. Emacs is a very powerful text editor that you can programatically extend to work the way you want it to by using a programming language known as Elisp, which is a Lisp extension. In case you use Emacs too, the ess package is all you really need.

If you're going to use Emacs, I encourage you to take a look through the ess package's documentation (https://ess.r-project.org/Manual/ess.html) and Johnson's presentation titled Emacs Has No Learning Curve, University of Kansas, 2015 (http://pj.freefaculty.org/guides/Rcourse/emacs-ess/emacs-ess.pdf). If you use Vim, Sublime Text, Atom, or other similar tools, I'm confident you can find useful packages as well.

The send to console functionality

The base R installation provides the console environment we mentioned in the previous section. This console is really all you need to work with R, but it will quickly become cumbersome to type everything directly into it and it may not be your best option. To efficiently work with R, you need to be able to experiment and iterate as fast as you can. Doing so will accelerate your learning curve and productivity.

Whichever tool you use, the key functionality you need is to be able to easily send code snippets into the console without having to type them yourself, or copying them from your editor and pasting them into the console. In RStudio, you can accomplish this by clicking on the Run or Source button in the top-right corner of the code editor panel. In Emacs, you may use the ess-eval-region command.

The efficient write-execute loop

One of the most productive ways to work with R, especially when learning it, is to use the write-execute loop, which makes use of the send to console functionality mentioned in the previous section. This will allow you to do two very important things: develop your code through small and quick iterations, which allow you to see step-by-step progress until you converge to the behavior you seek, and save the code you converged to as your final result, which can be easily reproduced using the source code file you used for your iterations. R source code files use the .R extension.

Assuming you have a source code file ready to send expressions to the console, the basic steps through the write-execute loop are as follows:

Define what behavior you're looking to implement with code.

Write the minimal amount of code necessary to achieve one piece of the behavior you seek in your implementation.

Use the send to console functionality to verify that the result in the console is what you expected, and if it's not, to identify possible causes.

If it's not what you expected, go back to the second step with the purpose of fixing the code until it has the intended piece of behavior.

If it's what you expected, go back to the second step with the purpose of extending the code with another piece of the behavior, until convergence.

This write-execute loop will become second nature to you as you start using it, and when it does, you'll be a more productive R programmer. It will allow you to diagnose issues faster, to quickly experiment with a few ways to accomplishing the same behavior to find which one seems best for your context, and once you have working code, it will also allow you to clean your implementation to keep the same behavior but have better or more readable code.

For experienced programmers, this should be a familiar process, and it's very similar to Test-Driven Development (TDD), but instead of using unit-tests to automatically test the code, you verify the output in the console in each iteration, and you don't have a set of tests to re-test each iteration. Even though TDD will not be used in this book, you can definitely use it in R.

I encourage you to use this write-execute loop to work through the examples presented in this book. At times, we will show step-by-step progress so that you understand the code better, but it's practically impossible to show all of the write-execute loop iterations I went through to develop it, and much of the knowledge you can acquire comes from iterating this way.

Executing R code in non-interactive sessions

Once your code has the functionality you were looking to implement, executing it through an interactive session using the console may not be the best way to do so. In such cases, another option you have is to tell your computer to directly execute the code for you, in a non-interactive session. This means that you won't be able to type commands into the console, but you'll get the benefit of being able to configure your computer to automatically execute code for you, or to integrate it into larger systems where R is only one of many components. This is known as batch mode.

To execute code in the batch mode, you have two options: the old R CMD BATCH command which we won't look into, and the newer Rscript command, which we will. The Rscript is a command that you can execute within your computer's terminal. It receives the name of a source code file and executes its contents.

In the following example, we will make use of various concepts that we will explain in later sections, so if you don't feel ready to understand it, feel free to skip it now and come back to it later.

Suppose you have the following code in a file named greeting.R. It gets the arguments passed through the command line to Rscript through the args object created with the commandArgs() function, assigns the corresponding values to the greeting and name variables, and finally prints a vector that contains those values.

args <- commandArgs(TRUE) greeting <- args[1] name <- args[2] print(c(greeting, name))

Once ready, you may use the Rscript command to execute it from your Terminal (not from within your R console) as is shown ahead.  The result shows the vector with the greeting and name variable values you passed it.

When you see a Command Prompt that begins with the $ symbol instead of of the > symbol, it means that you should execute that line in your computer's Terminal, not in the R console.

$ Rscript greeting.R Hi John [1] "Hi" "John"

Note that if you simply execute the file without any arguments, they will be passed as NA values, which allows you to customize your code to deal with such situations:

$ Rscript greeting.R [1] NA NA

This was a very simple example, but the same mechanism can be used to execute much more complex systems, like the one we will build in the final chapters of this book to constantly retrieve real-time price data from remote servers.

Finally, if you want to provide a mechanism that is closer to the one in Python, you may want to look into the optparse package to create command-line help pages as well as to parse arguments.

How to use this book

To make the most out of this book, you should recreate on your own the examples shown throughout, and make sure that you understand what each of them is doing in detail. If at some point you feel confused, it's not too difficult to do a couple of searches online to clarify things for yourself. However, I highly recommend that you look into the following books as well, which go into more detail on some of the concepts and ideas presented in this book, and are considered very good references for R programmers:

R in a Nutshell, by Adler, O'Reilly, 2010

The Art of R Programming, by Matloff, No Starch Press, 2011

Advanced R, by  Wickham, CRC Press, 2015

R Programming for Data Science, by Peng, LeanPub, 2016

Sometimes all you need to do to clarify something is use R's help system. To get help on a function, you may use the question mark notation, like ?function_name, but in case you want to search for help on a topic, you may use the help.search() function, like help.search(regression). This can be helpful if you know what topic you're interested in but can't remember the actual name of the function you want to use. Another way of invoking such functionality is using the double question mark notation, like ?? regression.

Keep in mind that topics in this book are interconnected and not linearly ordered, which means that at times it will seem that we are jumping around. When that happens, it's because a topic can be seen through different points of view. That's why, to make the most out of this book, you should experiment as much as you can in the console and build code progressively using the write-execute loop mentioned earlier. If you simply replicate the code exactly as is shown, you may miss some of the learning that you could have gotten had you built the systems step by step.

Finally, you should know that this book is meant to show how to use R through somewhat real examples, and as such, does not provide too much technical depth or discussion on some of the topics presented. Furthermore, since my objective is to get you quickly working with the real examples, in this first chapter, I explain R fundamentals very briefly, just to introduce the minimum amount of knowledge you need to follow through the real examples presented in the following chapters. Therefore, you should not think that the explanations presented in this chapter are enough for you to understand R's basic constructs. If you're looking for a more in-depth introduction to R fundamentals, you may want to take a look at the references we mentioned previously.

Tracking state with symbols and variables

Like most programming languages, R lets you assign values to variables and refer to these objects by name. The names you use to refer to variables are called symbols in R. This allows you to keep some information available in case it's needed at a later point in time. These variables may contain any type of object available in R, even combinations of them when using lists, as we will see in a later section in this chapter. Furthermore, these objects are immutable, but that's a topic for Chapter 9, Implementing an Efficient Simple Moving Average.

In R, the assignment operator is <-, which is a less-than symbol (<) followed by a dash (-). If you have worked with algorithm pseudo code before, you may find it familiar. You may also use the single equals symbol () for assignments, similar to many other languages, but I prefer to stick to the <- operator.

An expression like x <- 1 means that the value 1 is assigned to the  x symbol, which can be thought of as a variable. You can also assign the other way around, meaning that with an expression like 1 -> x we would have the same effect as we did earlier. However, the assignment from left to right is very rarely used, and is more of a convenience feature in case you forget the assignment operator at the beginning of a line in the console.

Note that the value substitution is done at the time when the value is assigned to z, not at the time when z is evaluated. If you enter the following code into the console, you can see that the second time z is printed, it still has the value that y had when it was used to assign to it, not the y value assigned afterward:

x <- 1 y <- 2 z <- c(x, y) z

#> [1] 1 2

y <- 3 z

#> [1] 1 2

It's easy to use variable names like x, y, and z, but using them has a high cost for real programs. When you use names like that, you probably have a very good idea of what values they will contain and how they will be used. In other words, their intention is clear for you. However, when you give your code to someone else or come back to it after a long period of time, those intentions may not be clear, and that's when cryptic names can be harmful. In real programs, your names should be self descriptive and instantly communicate intention.

For a deeper discussion about this and many other topics regarding high-quality code, take a look at Martin's excellent book Clean Code: A Handbook of Agile Software Craftsmanship, Prentice Hall, 2008.

Standard object names in R should only contain alphanumeric characters (numbers and ASCII letters), underscores (_), and, depending on context, even periods (.). However, R will allow you to use very cryptic strings if you want. For example, in the following code, we show how the variable !A @B #C $D %E ^F name is used to contain a vector with three integers. As you can see, you are even allowed to use spaces. You can use this non-standard name provided that you wrap the string with backticks (`):

`!A @B #C $D %E ^F` <- c(1, 2, 3) `!A @B #C $D %E ^F`

#> [1] 1 2 3

It goes without saying that you should avoid those names, but you should be aware they exist because they may come in handy when using some of R's more advanced features. These types of variable names are not allowed in most languages, but R is flexible in that way. Furthermore, the example goes to show a common theme around R programming: it is so flexible that if you're not careful, you will end up shooting yourself in the foot. It's not too rare for someone to be very confused about some code because they assumed R would behave a certain way (for example, raise an error under certain conditions) but don't explicitly test for such behavior, and later find that it behaves differently.

Working with data types and data structures

This section summarizes the most important data types and data structures in R. In this brief overview, we won't discuss them in depth. We will only show a couple of examples that will allow you to understand the code shown throughout this book. If you want to dig deeper into them, you may look into their documentation or some of the references pointed out in this chapter's introduction.

The basic data types in R are numbers, text, and Boolean values (TRUE or FALSE), which R calls numerics, characters, and logicals, respectively. Strictly speaking, there are also types for integers, complex numbers, and raw data (bytes), but we won't use them explicitly in this book. The six basic data structures in R are vectors, factors, matrices, data frames, and lists, which we will summarize in the following sections.

Numerics

Numbers in R behave pretty much as you would mathematically expect them to. For example, the operation 2 / 3