E-Book
39,59 €

RStudio for R Statistical Computing Cookbook E-Book

Andrea Cirillo

0,0

39,59 €

Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.

Mehr erfahren.

Herausgeber: Packt Publishing
Kategorie: Wissenschaft und neue Technologien
Sprache: Englisch

Beschreibung

Over 50 practical and useful recipes to help you perform data analysis with R by unleashing every native RStudio feature

About This Book

54 useful and practical tasks to improve working systems
Includes optimizing performance and reliability or uptime, reporting, system management tools, interfacing to standard data ports, and so on
Offers 10-15 real-life, practical improvements for each user type

Who This Book Is For

This book is targeted at R statisticians, data scientists, and R programmers. Readers with R experience who are looking to take the plunge into statistical computing will find this Cookbook particularly indispensable.

What You Will Learn

Familiarize yourself with the latest advanced R console features
Create advanced and interactive graphics
Manage your R project and project files effectively
Perform reproducible statistical analyses in your R projects
Use RStudio to design predictive models for a specific domain-based application
Use RStudio to effectively communicate your analyses results and even publish them to a blog
Put yourself on the frontiers of data science and data monetization in R with all the tools that are needed to effectively communicate your results and even transform your work into a data product

In Detail

The requirement of handling complex datasets, performing unprecedented statistical analysis, and providing real-time visualizations to businesses has concerned statisticians and analysts across the globe. RStudio is a useful and powerful tool for statistical analysis that harnesses the power of R for computational statistics, visualization, and data science, in an integrated development environment.

This book is a collection of recipes that will help you learn and understand RStudio features so that you can effectively perform statistical analysis and reporting, code editing, and R development. The first few chapters will teach you how to set up your own data analysis project in RStudio, acquire data from different data sources, and manipulate and clean data for analysis and visualization purposes. You'll get hands-on with various data visualization methods using ggplot2, and you will create interactive and multidimensional visualizations with D3.js. Additional recipes will help you optimize your code; implement various statistical models to manage large datasets; perform text analysis and predictive analysis; and master time series analysis, machine learning, forecasting; and so on. In the final few chapters, you'll learn how to create reports from your analytical application with the full range of static and dynamic reporting tools that are available in RStudio so that you can effectively communicate results and even transform them into interactive web applications.

Style and approach

RStudio is an open source Integrated Development Environment (IDE) for the R platform. The R programming language is used for statistical computing and graphics, which RStudio facilitates and enhances through its integrated environment.

This Cookbook will help you learn to write better R code using the advanced features of the R programming language using RStudio. Readers will learn advanced R techniques to compute the language and control object evaluation within R functions. Some of the contents are:

Accessing an API with R
Substituting missing values by interpolation
Performing data filtering activities
R Statistical implementation for Geospatial data
Developing shiny add-ins to expand RStudio functionalities
Using GitHub with RStudio
Modelling a recommendation engine with R
Using R Markdown for static and dynamic reporting
Curating a blog through RStudio
Advanced statistical modelling with R and RStudio

Details

Sie lesen das E-Book in den Legimi-Apps auf:

Android

iOS

von Legimi
zertifizierten E-Readern

Seitenzahl: 254

Veröffentlichungsjahr: 2016

Bewertungen

0,0

Rezensionen(0 Rezensionen)

Leseprobe

RStudio for R Statistical Computing Cookbook

Credits

About the Author

About the Reviewer

www.PacktPub.com

eBooks, discount offers, and more

Why Subscribe?

Preface

What this book covers

What you need for this book

Who this book is for

Sections

Getting ready

How to do it…

How it works…

There's more…

See also

Conventions

Reader feedback

Customer support

Downloading the example code

Downloading the color images of this book

Errata

Piracy

Questions

1. Acquiring Data for Your Project

Introduction

Acquiring data from the Web – web scraping tasks

Getting ready

How to do it...

There's more...

Accessing an API with R

Getting ready

How to do it…

How it works...

There's more...

Getting data from Twitter with the twitteR package

Getting ready

How to do it…

There's more...

Getting data from Facebook with the Rfacebook package

Getting ready

How to do it...

Getting data from Google Analytics

Getting ready

How to do it...

There's more...

Loading your data into R with rio packages

Getting ready

How to do it...

How it works...

There's more...

Converting file formats using the rio package

Getting ready

How to do it...

There's more...

2. Preparing for Analysis – Data Cleansing and Manipulation

Introduction

Getting a sense of your data structure with R

Getting ready

How to do it...

How it works...

Preparing your data for analysis with the tidyr package

Getting ready

How to do it...

How it works...

There's more...

Detecting and removing missing values

Getting ready

How to do it...

How it works...

There's more...

Substituting missing values using the mice package

Getting ready

How to do it...

How it works...

There's more...

Detecting and removing outliers

How to do it...

How it works...

Performing data filtering activities

Getting ready

How to do it…

How it works...

3. Basic Visualization Techniques

Introduction

Looking at your data using the plot() function

Getting ready

How to do it...

How it works...

Using pairs.panel() to look at (visualize) correlations between variables

Getting ready

How to do it...

How it works…

There's more…

Adding text to a ggplot2 plot at a custom location

Getting ready

How to do it...

How it works…

There's more…

Changing axes appearance to ggplot2 plot (continous axes)

Getting ready

How to do it...

Producing a matrix of graphs with ggplot2

Getting ready

How to do it...

How it works…

Drawing a route on a map with ggmap

Getting ready

How to do it...

How it works…

See also

Making use of the igraph package to draw a network

Getting ready

How to do it...

How it works…

Showing communities in a network with the linkcomm package

Getting ready

How to do it…

How it works…

4. Advanced and Interactive Visualization

Introduction

Producing a Sankey diagram with the networkD3 package

Getting ready

How to do it...

How it works...

Creating a dynamic force network with the visNetwork package

Getting ready

How to do it...

How it works...

There's more...

Building a rotating 3D graph and exporting it as a GIF

Getting ready

How to do it...

Using the DiagrammeR package to produce a process flow diagram in RStudio

Getting ready

How to do it...

5. Power Programming with R

Introduction

Writing modular code in RStudio

Getting ready

How to do it...

How it works...

Implementing parallel computation in R

Getting ready

How to do it...

How it works...

There's more...

Creating custom objects and methods in R using the S3 system

How to do it...

How it works...

Evaluating your code performance using the profvis package

Getting ready

How to do it...

Comparing an alternative function's performance using the microbenchmarking package

Getting ready

How to do it...

Using GitHub with RStudio

Getting ready

How to do it...

There's more...

6. Domain-specific Applications

Introduction

Dealing with regular expressions

How to do it...

Analyzing PDF reports in a folder with the tm package

Getting ready

How to do it...

How it works...

Creating word clouds with the wordcloud package

Getting ready

How to do it...

How it works...

Performing a Twitter sentiment analysis

Getting ready

How to do it...

How it works...

Detecting fraud in e-commerce orders with Benford's law

Getting ready

How to do it...

How it works...

Measuring customer retention using cohort analysis in R

Getting ready

How to do it...

How it works...

Making a recommendation engine

Getting ready

How to do it...

Performing time series decomposition using the stl() function

Getting ready

How to do it...

Exploring time series forecasting with forecast()

Getting ready

How to do it...

Tracking stock movements using the quantmod package

Getting ready

How to do it...

Optimizing portfolio composition and maximising returns with the Portfolio Analytics package

Getting ready

How to do it...

Forecasting the stock market

Getting ready

How to do it...

7. Developing Static Reports

Introduction

Using one markup language for all types of documents – rmarkdown

Getting ready

How to do it...

How it works...

There’s more...

Writing and styling PDF documents with RStudio

Getting ready

How to do it...

There’s more...

Writing wonderful tufte handouts with the tufte package and rmarkdown

Getting ready

How to do it...

There’s more...

Sharing your code and plots with slides

How to do it...

Curating a blog through RStudio

Getting ready

How to do it...

8. Dynamic Reporting and Web Application Development

Introduction

Generating dynamic parametrized reports with R Markdown

Getting ready

How to do it...

How it works…

There's more…

Developing a single-file Shiny app

Getting ready

How to do it…

How it works…

See also

Changing a Shiny app UI based on user input

Getting ready

How to do it...

See also

Creating an interactive report with Shiny

How to do it…

How it works...

See also

Constructing RStudio add-ins

Getting ready

How to do it...

There's more…

Sharing your work on RPubs

Getting ready

How to do it...

There's more…

Deploying your app on Amazon AWS with ramazon

Getting ready

How to do it...

Index

RStudio for R Statistical Computing Cookbook

All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing, and its dealers and distributors will be held liable for any damages caused or alleged to be caused directly or indirectly by this book.

Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

First published: April 2016

Production reference: 1250416

Published by Packt Publishing Ltd.

Livery Place

35 Livery Street

Birmingham B3 2PB, UK.

ISBN 978-1-78439-103-4

www.packtpub.com

Credits

Author

Andrea Cirillo

Reviewer

Mark van der Loo

Commissioning Editor

Kartikey Pandey

Acquisition Editor

Vinay Argekar

Content Development Editor

Deepti Thore

Technical Editor

Madhunikita Sunil Chindarkar

Copy Editor

Karuna Narayan

Project Coordinator

Shweta H Birwatkar

Proofreader

Safis Editing

Indexer

Rekha Nair

Graphics

Disha Haria

Production Coordinator

Aparna Bhagat

Cover Work

Aparna Bhagat

About the Author

Andrea Cirillo is currently working as an internal auditor at Intesa Sanpaolo banking group. He gained a lot of financial and external audit experience at Deloitte Touche Tohmatsu and internal audit experience at FNM, a listed Italian company.

His current main responsibilities involve evaluation of credit risk management models and their enhancement mainly within the field of the Basel III capital agreement.

He is married to Francesca and is the father of Tommaso, Gianna, and Zaccaria.

Andrea has written and contributed to a few useful R packages and regularly shares insightful advice and tutorials about R programming.

His research and work mainly focuses on the use of R in the fields of risk management and fraud detection, mainly through modeling custom algorithms and developing interactive applications.

This book is the result of a lot of patience by my wife and sons, which left me with the time to write this book, the time that I should have spend with them.

By Deepti Thore, my content developer editor at Packt Publishing, who was so clement with me when, and it happened a lot of time, I missed my writing deadlines.

By my colleagues who endured my talks about the book every three hours and when I asked for their opinions about almost every recipe.

To all of you, I would like to say a sincere thank you.

About the Reviewer

Mark van der Loo is a statistical researcher who specializes in data cleaning methodology and likes to program in R and C. He is the author and coauthor of several R packages published on CRAN, including stringdist, validate, deductive, lintools, and several others. In 2012, he authored Learning RStudio for R Statistical Computing, Packt Publishing, with Edwin de Jonge.

www.PacktPub.com

eBooks, discount offers, and more

Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.PacktPub.com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at <[email protected]> for more details.

At www.PacktPub.com, you can also read a collection of free technical articles, sign up for a range of free newsletters and receive exclusive discounts and offers on Packt books and eBooks.

https://www2.packtpub.com/books/subscription/packtlib

Do you need instant solutions to your IT questions? PacktLib is Packt's online digital book library. Here, you can search, access, and read Packt's entire library of books.

Why Subscribe?

Fully searchable across every book published by PacktCopy and paste, print, and bookmark contentOn demand and accessible via a web browser

Preface

Why should you read RStudio for R Statistical Computing Cookbook?

Well, even if there are plenty of books and blog posts about R and RStudio out there, this cookbook can be an unbeatable friend through your journey from being an average R and RStudio user to becoming an advanced and effective R programmer.

I have collected more than 50 recipes here, covering the full spectrum of data analysis activities, from data acquisition and treatment to results reporting.

All of them come from my direct experience as an auditor and data analyst and from knowledge sharing with the really dynamic and always growing R community.

I took great care selecting and highlighting those packages and practices that have proven to be the best for a given particular task, sometimes choosing between different packages designed for the same purpose.

You can therefore be sure that what you will learn here is the cutting edge of the R language and will place you on the right track of your learning path to R's mastery.

What this book covers

Chapter 1, Acquiring Data for Your Project, shows you how to import data into the R environment, taking you through web scraping and the process of connecting to an API.

Chapter 2, Preparing for Analysis – Data Cleansing and Manipulation, teaches you how to get your data ready for analysis, leveraging the latest data-handling packages and advanced statistical techniques for missing values and outlier treatments.

Chapter 3, Basic Visualization Techniques, lets you get the first sense of your data, highlighting its structure and discovering patterns within it.

Chapter 4, Advanced and Interactive Visualization, shows you how to produce advanced visualizations ranging from 3D graphs to animated plots.

Chapter 5, Power Programming with R, discusses how to write efficient R code, making use of the R objective-oriented systems and advanced tools for code performance evaluation.

Chapter 6, Domain-specific Applications, shows you how to apply the R language to a wide range of problems related to different domains, from financial portfolio optimization to e-commerce fraud detection.

Chapter 7, Developing Static Reports, helps you discover the reporting tools available within the RStudio IDE and how to make the most of them to produce static reports for sharing results of your work.

Chapter 8, Dynamic Reporting and Web Application Development, displays the collected recipes designed to make use of the latest features introduced in RStudio from shiny web applications with dynamic UIs to RStudio add-ons.

What you need for this book

The basic requirements for this book are the latest versions of R and RStudio, which you can download from the following URLs:

For Windows: https://cran.r-project.org/bin/windows/base/For Mac OS X: https://cran.r-project.org/bin/macosx/https://www.rstudio.com/products/rstudio/download/

More software will be needed for a few specific recipes, which will be highlighted in the Getting Ready section of the respective recipe.

Just a closing note: all the software employed in this book is available for free for personal use, and the greatest advantage of them is that they are open source and powered by the R community.

Who this book is for

This book was developed and written keeping in mind an average R and RStudio user who would like to make the move from good to great in the field of their programming skills on the language.

If you think you are quite good at R and RStudio but you are still missing something in order to be great, this book is exactly what you need to read.

Sections

In this book, you will find several headings that appear frequently (Getting ready, How to do it, How it works, There's more, and See also).

To give clear instructions on how to complete a recipe, we use these sections as follows:

Getting ready

This section tells you what to expect in the recipe, and describes how to set up any software or any preliminary settings required for the recipe.

How to do it…

This section contains the steps required to follow the recipe.

How it works…

This section usually consists of a detailed explanation of what happened in the previous section.

There's more…

This section consists of additional information about the recipe in order to make the reader more knowledgeable about the recipe.

Conventions

In this book, you will find a number of text styles that distinguish between different kinds of information. Here are some examples of these styles and an explanation of their meaning.

Code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles are shown as follows: "The plot() function is one of most powerful functions in base R."

A block of code is set as follows:

> str(lesmiserables) 'data.frame': 254 obs. of 2 variables: $ V1: Factor w/ 73 levels "Anzelma","Babet",..: 61 49 55 55 21 33 12 23 20 62 ... $ V2: Factor w/ 49 levels "Babet","Bahorel",..: 42 42 42 36 42 42 42 42 42 42 ...

Any command-line input or output is written as follows:

install.packages("linkcomm")library(linkcomm)

New terms and important words are shown in bold. Words that you see on the screen, for example, in menus or dialog boxes, appear in the text like this: "In order to embed your Sankey diagram, you can leverage the RStudio Save as Web Page control from the Export menu."

Note

Warnings or important notes appear in a box like this.

Tip

Tips and tricks appear like this.

Reader feedback

Feedback from our readers is always welcome. Let us know what you think about this book—what you liked or disliked. Reader feedback is important for us as it helps us develop titles that you will really get the most out of.

To send us general feedback, simply e-mail <[email protected]>, and mention the book's title in the subject of your message.

If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, see our author guide at www.packtpub.com/authors.

Customer support

Now that you are the proud owner of a Packt book, we have a number of things to help you to get the most from your purchase.

Downloading the example code

You can download the example code files for this book from your account at http://www.packtpub.com. If you purchased this book elsewhere, you can visit http://www.packtpub.com/support and register to have the files e-mailed directly to you.

You can download the code files by following these steps:

Log in or register to our website using your e-mail address and password.Hover the mouse pointer on the SUPPORT tab at the top.Click on Code Downloads & Errata.Enter the name of the book in the Search box.Select the book for which you're looking to download the code files.Choose from the drop-down menu where you purchased this book from.Click on Code Download.

You can also download the code files by clicking on the Code Files button on the book's webpage at the Packt Publishing website. This page can be accessed by entering the book's name in the Search box. Please note that you need to be logged in to your Packt account.

Once the file is downloaded, please make sure that you unzip or extract the folder using the latest version of:

WinRAR / 7-Zip for WindowsZipeg / iZip / UnRarX for Mac7-Zip / PeaZip for Linux

Downloading the color images of this book

We also provide you with a PDF file that has color images of the screenshots/diagrams used in this book. The color images will help you better understand the changes in the output. You can download this file from https://www.packtpub.com/sites/default/files/downloads/RStudioforRStatisticalComputingCookbook_ColorImages.pdf.

Errata

Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you find a mistake in one of our books—maybe a mistake in the text or the code—we would be grateful if you could report this to us. By doing so, you can save other readers from frustration and help us improve subsequent versions of this book. If you find any errata, please report them by visiting http://www.packtpub.com/submit-errata, selecting your book, clicking on the Errata Submission Form link, and entering the details of your errata. Once your errata are verified, your submission will be accepted and the errata will be uploaded to our website or added to any list of existing errata under the Errata section of that title.

To view the previously submitted errata, go to https://www.packtpub.com/books/content/support and enter the name of the book in the search field. The required information will appear under the Errata section.

Piracy

Piracy of copyrighted material on the Internet is an ongoing problem across all media. At Packt, we take the protection of our copyright and licenses very seriously. If you come across any illegal copies of our works in any form on the Internet, please provide us with the location address or website name immediately so that we can pursue a remedy.

Please contact us at <[email protected]> with a link to the suspected pirated material.

We appreciate your help in protecting our authors and our ability to bring you valuable content.

Questions

If you have a problem with any aspect of this book, you can contact us at <[email protected]>, and we will do our best to address the problem.

Chapter 1. Acquiring Data for Your Project

In this chapter, we will cover the following recipes:

Acquiring data from the Web—web scraping tasksAccessing an API with RGetting data from Twitter with the twitteR packageGetting data from Facebook with the Rfacebook packageGetting data from Google AnalyticsLoading your data into R with rio packagesConverting file formats using the rio package

Introduction

The American statistician Edward Deming once said:

"Without data you are just another man with an opinion."

I think this great quote is enough to highlight the importance of the data acquisition phase of every data analysis project. This phase is exactly where we are going to start from. This chapter will give you tools for scraping the Web, accessing data via web APIs, and importing nearly every kind of file you will probably have to work with quickly, thanks to the magic package rio.

All the recipes in this book are based on the great and popular packages developed and maintained by the members of the R community.

After reading this section, you will be able to get all your data into R to start your data analysis project, no matter where it comes from.

Before starting the data acquisition process, you should gain a clear understanding of your data needs. In other words, what data do you need in order to get solutions to your problems?

A rule of thumb to solve this problem is to look at the process that you are investigating—from input to output—and outline all the data that will go in and out during its development.

In this data, you will surely have that chunk of data that is needed to solve your problem.

In particular, for each type of data you are going to acquire, you should define the following:

The source: This is where data is storedThe required authorizations: This refers to any form of authorization/authentication that is needed in order to get the data you needThe data format: This is the format in which data is made availableThe data license: This is to check whether there is any license covering data utilization/distribution or whether there is any need for ethics/privacy considerations

After covering these points for each set of data, you will have a clear vision of future data acquisition activities. This will let you plan ahead the activities needed to clearly define resources, steps, and expected results.

Acquiring data from the Web – web scraping tasks

Given the advances in the Internet of Things (IoT) and the progress of cloud computing, we can quietly affirm that in future, a huge part of our data will be available through the Internet, which on the other hand doesn't mean it will be public.

It is, therefore, crucial to know how to take that data from the Web and load it into your analytical environment.

You can find data on the Web either in the form of data statically stored on websites (that is, tables on Wikipedia or similar websites) or in the form of data stored on the cloud, which is accessible via APIs.

For API recipes, we will go through all the steps you need to get data statically exposed on websites in the form of tabular and nontabular data.

This specific example will show you how to get data from a specific Wikipedia page, the one about the R programming language: https://en.wikipedia.org/wiki/R_(programming_language).

Getting ready

Data statically exposed on web pages is actually pieces of web page code. Getting them from the Web to our R environment requires us to read that code and find where exactly the data is.

Dealing with complex web pages can become a really challenging task, but luckily, SelectorGadget was developed to help you with this job. SelectorGadget is a bookmarklet, developed by Andrew Cantino and Kyle Maxwell, that lets you easily figure out the CSS selector of your data on the web page you are looking at. Basically, the CSS selector can be seen as the address of your data on the web page, and you will need it within the R code that you are going to write to scrape your data from the Web (refer to the next paragraph).

Note

The CSS selector is the token that is used within the CSS code to identify elements of the HTML code based on their name.

CSS selectors are used within the CSS code to identify which elements are to be styled using a given piece of CSS code. For instance, the following script will align all elements (CSS selector *) with 0 margin and 0 padding:

* { margin: 0; padding: 0; }

SelectorGadget is currently employable only via the Chrome browser, so you will need to install the browser before carrying on with this recipe. You can download and install the last version of Chrome from https://www.google.com/chrome/.

SelectorGadget is available as a Chrome extension; navigate to the following URL while already on the page showing the data you need:

:javascript:(function(){ var%20s=document.createElement('div'); s.innerHTML='Loading…' ;s.style.color='black'; s.style.padding='20px'; s.style.position='fixed'; s.style.zIndex='9999'; s.style.fontSize='3.0em'; s.style.border='2px%20solid%20black'; s.style.right='40px'; s.style.top='40px'; s.setAttribute('class','selector_gadget_loading'); s.style.background='white'; document.body.appendChild(s); s=document.createElement('script'); s.setAttribute('type','text/javascript'); s.setAttribute('src','https://dv0akt2986vzh.cloudfront.net/unstable/lib/selectorgadget.js');document.body.appendChild(s); })();

This long URL shows that the CSS selector is provided as JavaScript; you can make this out from the :javascript: token at the very beginning.

We can further analyze the URL by decomposing it into three main parts, which are as follows:

Creation on the page of a new element of the div class with the document.createElement('div') statementAesthetic attributes setting, composed by all the s.style… tokensThe .js file content retrieving at https://dv0akt2986vzh.cloudfront.net/unstable/lib/selectorgadget.js

The .js file is where the CSS selector's core functionalities are actually defined and the place where they are taken to make them available to users.

That being said, I'm not suggesting that you try to use this link to employ SelectorGadget for your web scraping purposes, but I would rather suggest that you look for the Chrome extension or at the official SelectorGadget page, http://selectorgadget.com. Once you find the link on the official page, save it as a bookmark so that it is easily available when you need it.

The other tool we are going to use in this recipe is the rvest package, which offers great web scraping functionalities within the R environment.

To make it available, you first have to install and load it in the global environment that runs the following:

install.packages("rvest") library(rvest)

How to do it...

Run SelectorGadget. To do so, after navigating to the web page you are interested in, activate SelectorGadget by running the Chrome extension or clicking on the bookmark that we previously saved.

In both cases, after activating the gadget, a Loading… message will appear, and then, you will find a bar on the bottom-right corner of your web browser, as shown in the following screenshot:

You are now ready to select the data you are interested in.

Select the data you are interested in. After clicking on the data you are going to scrape, you will note that beside the data you've selected, there are some other parts on the page that will turn yellow:

This is because SelectorGadget is trying to guess what you are looking at by highlighting all the elements included in the CSS selector that it considers to be most useful for you.

If it is guessing wrong, you just have to click on the wrongly highlighted parts and those will turn red:

When you are done with this fine-tuning process, SelectorGadget will have correctly identified a proper selector, and you can move on to the next step.

Find your data location on the page. To do this, all you have to do is copy the CSS selector that you will find in the bar at the bottom-right corner:

This piece of text will be all you need in order to scrape the web page from R.

The next step is to read data from the Web with the rvest package. The rvest package by Hadley Wickham is one of the most comprehensive packages for web scraping activities in R. Take a look at the There's more... section for further information on package objectives and functionalities.

For now, it is enough to know that the rvest package lets you download HTML code and read the data stored within the code easily.

Now, we need to import the HTML code from the web page. First of all, we need to define an object storing all the html code of the web page you are looking at:

page_source <- read_html('https://en.wikipedia.org/wiki/R_(programming_language)

This code leverages read_html function(), which retrieves the source code that resides at the written URL directly from the Web.

Next, we will select the defined blocks. Once you have got your HTML code, it is time to extract the part of the code you are interested in. This is done using the html_nodes() function, which is passed as an argument in the CSS selector and retrieved using SelectorGadget. This will result in a line of code similar to the following:

version_block <- html_nodes(page_source,".wikitable th , .wikitable td")

As you can imagine, this code extracts all the content of the selected nodes, including HTML tags.

Note

The HTML language

HyperText Markup Language (HTML) is a markup language that is used to define the format of web pages.

The basic idea behind HTML is to structure the web page into a format with a head and body, each of which contains a variable number of tags, which can be considered as subcomponents of the structure.

The head is used to store information and components that will not be seen by the user but will affect the web page's behavior, for instance, in a Google Analytics script used for tracking page visits, the body contains all the contents which will be showed to the reader.

Since the HTML code is composed of a nested structure, it is common to compare this structure to a tree, and here, different components are also referred to as nodes.

Printing out the version_blockobject, you will obtain a result similar to the following:

print(version_block) {xml_nodeset (45)} [1] <th>Release</th> [2] <th>Date</th> [3] <th>Description</th> [4] <th>0.16</th> [5] <td/> [6] <td>This is the last <a href="/wiki/Alpha_test" title="Alpha test" class="mw-redirect">alp ... [7] <th>0.49</th> [8] <td style="white-space:nowrap;">1997-04-23</td> [9] <td>This is the oldest available <a href="/wiki/Source_code" title="Source code">source</a ... [10] <th>0.60</th> [11] <td>1997-12-05</td> [12] <td>R becomes an official part of the <a href="/wiki/GNU_Project" title="GNU Project">GNU ... [13] <th>1.0</th> [14] <td>2000-02-29</td> [15] <td>Considered by its developers stable enough for production use.<sup id="cite_ref-35" cl ... [16] <th>1.4</th> [17] <td>2001-12-19</td> [18] <td>S4 methods are introduced and the first version for <a href="/wiki/Mac_OS_X" title="Ma ... [19] <th>2.0</th> [20] <td>2004-10-04</td>

This result is not exactly what you are looking for if you are going to work with this data. However, you don't have to worry about that since we are going to give your text a better shape in the very next step.

In order to obtain a readable and actionable format, we need one more step: extracting text from HTML tags.

This can be done using the html_text()

Tausende von E-Books und Hörbücher

Ihre Zahl wächst ständig und Sie haben eine Fixpreisgarantie.

Sie haben über uns geschrieben:

RStudio for R Statistical Computing Cookbook E-Book

Andrea Cirillo

About This Book

Who This Book Is For

What You Will Learn

In Detail

Style and approach

Table of Contents

RStudio for R Statistical Computing Cookbook

RStudio for R Statistical Computing Cookbook

Credits

About the Author

About the Reviewer

www.PacktPub.com

eBooks, discount offers, and more

Why Subscribe?

Preface

What this book covers

What you need for this book

Who this book is for

Sections

Getting ready

How to do it…

How it works…

There's more…

See also

Conventions

Note

Tip

Reader feedback

Customer support

Downloading the example code

Downloading the color images of this book

Errata

Piracy

Questions

Chapter 1. Acquiring Data for Your Project

Introduction

Acquiring data from the Web – web scraping tasks

Getting ready

Note

How to do it...

Note