Automated Data Collection with R - Simon Munzert - E-Book

Automated Data Collection with R E-Book

Simon Munzert

0,0
60,99 €

oder
-100%
Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.

Mehr erfahren.
Beschreibung

A hands on guide to web scraping and text mining for both beginners and experienced users of R

  • Introduces fundamental concepts of the main architecture of the web and databases and covers HTTP, HTML, XML, JSON, SQL.
  • Provides basic techniques to query web documents and data sets (XPath and regular expressions).
  • An extensive set of exercises are presented to guide the reader through each technique.
  • Explores both supervised and unsupervised techniques as well as advanced techniques such as data scraping and text management.
  • Case studies are featured throughout along with examples for each technique presented.
  • R code and solutions to exercises featured in the book are provided on a supporting website.

Sie lesen das E-Book in den Legimi-Apps auf:

Android
iOS
von Legimi
zertifizierten E-Readern

Seitenzahl: 762

Veröffentlichungsjahr: 2014

Bewertungen
0,0
0
0
0
0
0
Mehr Informationen
Mehr Informationen
Legimi prüft nicht, ob Rezensionen von Nutzern stammen, die den betreffenden Titel tatsächlich gekauft oder gelesen/gehört haben. Wir entfernen aber gefälschte Rezensionen.



Automated data collection with R

A Practical Guide to Web Scraping and Text Mining

Simon Munzert

Department of Politics and Public Administration, University of Konstanz, Germany

Christian Rubba

Department of Political Science, University of Zurich and National Center of Competence in Research, Switzerland

Peter Meißner

Department of Politics and Public Administration, University of Konstanz, Germany

Dominic Nyhuis

Department of Political Science, University of Mannheim, Germany

This edition first published 2015 © 2015 John Wiley & Sons, Ltd

Registered officeJohn Wiley & Sons Ltd, The Atrium, Southern Gate, Chichester, West Sussex, PO19 8SQ, United Kingdom

For details of our global editorial offices, for customer services and for information about how to apply for permission to reuse the copyright material in this book please see our website at www.wiley.com.

The right of the author to be identified as the author of this work has been asserted in accordance with the Copyright, Designs and Patents Act 1988.

All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, recording or otherwise, except as permitted by the UK Copyright, Designs and Patents Act 1988, without the prior permission of the publisher.

Wiley also publishes its books in a variety of electronic formats. Some content that appears in print may not be available in electronic books.

Designations used by companies to distinguish their products are often claimed as trademarks. All brand names and product names used in this book are trade names, service marks, trademarks or registered trademarks of their respective owners. The publisher is not associated with any product or vendor mentioned in this book.

Limit of Liability/Disclaimer of Warranty: While the publisher and author have used their best efforts in preparing this book, they make no representations or warranties with respect to the accuracy or completeness of the contents of this book and specifically disclaim any implied warranties of merchantability or fitness for a particular purpose. It is sold on the understanding that the publisher is not engaged in rendering professional services and neither the publisher nor the author shall be liable for damages arising herefrom. If professional advice or other expert assistance is required, the services of a competent professional should be sought.

Library of Congress Cataloging-in-Publication Data

Munzert, Simon. Automated data collection with R : a practical guide to web scraping and text mining / Simon Munzert, Christian Rubba, Peter Meißner, Dominic Nyhuis. pages cm Summary: “This book provides a unified framework of web scraping and information extraction from text data with R for the social sciences”– Provided by publisher. Includes bibliographical references and index. ISBN 978-1-118-83481-7 (hardback) 1. Data mining. 2. Automatic data collection systems. 3. Social sciences–Research–Data processing. 4. R (Computer program language) I. Title. QA76.9.D343M865 2014 006.3′12–dc23

2014032266

A catalogue record for this book is available from the British Library.

ISBN: 9781118834817

To my parents, for their unending support. Also, to Stefanie.

—Simon

To my parents, for their love and encouragement.

—Christian

To Kristin, Buddy, and Paul for love, regular walks, and a final deadline.

—Peter

Meiner Familie.

—Dominic

CONTENTS

Preface

What you won't learn from reading this book

Why

R

?

Recommended reading to get started with

R

Typographic conventions

The book's website

Disclaimer

Acknowledgments

Note

Chapter 1: Introduction

1.1 Case study: World Heritage Sites in Danger

1.2 Some remarks on web data quality

1.3 Technologies for disseminating, extracting, and storing web data

1.4 Structure of the book

Notes

Part One: A Primer on Web and Data Technologies

Chapter 2: HTML

2.1 Browser presentation and source code

2.2 Syntax rules

2.3 Tags and attributes

2.4 Parsing

Summary

Further reading

Problems

Notes

Chapter 3: XML and JSON

3.1 A short example XML document

3.2 XML syntax rules

3.3 When is an XML document well formed or valid?

3.4 XML extensions and technologies

3.5 XML and

R

in practice

3.6 A short example JSON document

3.7 JSON syntax rules

3.8 JSON and

R

in practice

Summary

Further reading

Problems

Notes

Chapter 4: XPath

4.1 XPath—a query language for web documents

4.2 Identifying node sets with XPath

4.3 Extracting node elements

Summary

Further reading

Problems

Notes

Chapter 5: HTTP

5.1 HTTP fundamentals

5.2 Advanced features of HTTP

5.3 Protocols beyond HTTP

5.4 HTTP in action

Summary

Further reading

Problems

Notes

Chapter 6: AJAX

6.1 JavaScript

6.2 XHR

6.3 Exploring AJAX with Web Developer Tools

Summary

Further reading

Problems

Chapter 7: SQL and relational databases

7.1 Overview and terminology

7.2 Relational Databases

7.3 SQL: a language to communicate with Databases

7.4 Databases in action

Summary

Further reading

Problems

Pokemon problems

ParlGov problems

Notes

Chapter 8: Regular expressions and essential string functions

8.1 Regular expressions

8.2 String processing

8.3 A word on character encodings

Summary

Further reading

Problems

Notes

Part Two: A Practical Toolbox for Web Scraping and Text Mining

Chapter 9: Scraping the Web

9.1 Retrieval scenarios

9.2 Extraction strategies

9.3 Web scraping: Good practice

9.4 Valuable sources of inspiration

Summary

Further reading

Problems

Notes

Chapter 10: Statistical text processing

10.1 The running example: Classifying press releases of the British government

10.2 Processing textual data

10.3 Supervised learning techniques

10.4 Unsupervised learning techniques

Summary

Further reading

Notes

Chapter 11: Managing data projects

11.1 Interacting with the file system

11.2 Processing multiple documents/links

11.3 Organizing scraping procedures

11.4 Executing

R

scripts on a regular basis

Notes

Part Three: A Bag of Case Studies

Chapter 12: Collaboration networks in the US Senate

12.1 Information on the bills

12.2 Information on the senators

12.3 Analyzing the network structure

12.4 Conclusion

Notes

Chapter 13: Parsing information from semistructured documents

13.1 Downloading data from the FTP server

13.2 Parsing semistructured text data

13.3 Visualizing station and temperature data

Notes

Chapter 14: Predicting the 2014 Academy Awards using Twitter

14.1 Twitter APIs: Overview

14.2 Twitter-based forecast of the 2014 Academy Awards

14.3 Conclusion

Notes

Chapter 15: Mapping the geographic distribution of names

15.1 Developing a data collection strategy

15.2 Website inspection

15.3 Data retrieval and information extraction

15.4 Mapping names

15.5 Automating the process

Summary

Notes

Chapter 16: Gathering data on mobile phones

16.1 Page exploration

16.2 Scraping procedure

16.3 Graphical analysis

16.4 Data storage

Note

Chapter 17: Analyzing sentiments of product reviews

17.1 Introduction

17.2 Collecting the data

17.3 Analyzing the data

17.4 Conclusion

Notes

References

General index

Package index

Function index

End User License Agreement

List of Tables

Chapter 2

Table 2.1

Table 2.2

Table 2.3

Chapter 3

Table 3.1

Table 3.2

Table 3.3

Table 3.4

Chapter 4

Table 4.1

Table 4.2

Table 4.3

Table 4.4

Chapter 5

Table 5.1

Table 5.2

Table 5.3

Table 5.4

Chapter 7

Table 7.1

Table 7.2

Table 7.3

Table 7.4

Table 7.5

Table 7.6

Table 7.7

Table 7.8

Table 7.9

Table 7.10

Table 7.11

Chapter 8

Table 8.1

Table 8.2

Table 8.3

Table 8.4

Table 8.5

Chapter 9

Table 9.1

Chapter 10

Table 10.1

Chapter 11

Table 11.1

Table 11.2

Chapter 12

Table 12.1

Table 12.2

Table 12.3

Table 12.4

Table 12.5

Chapter 13

Table 13.1

Chapter 14

Table 14.1

Table 14.2

Chapter 15

Table 15.1

List of Illustrations

Chapter 1

Figure 1.1 Location of UNESCO World Heritage Sites in danger (as of March 2014). Cultural sites are marked with triangles, natural sites with dots

Figure 1.2 Distribution of years when World Heritage Sites were put on the list of endangered sites

Figure 1.3 Distribution of time spans between year of inscription and year of endangerment of World Heritage Sites in danger

Figure 1.4 Technologies for disseminating, extracting, and storing web data

Chapter 2

Figure 2.1 Browser view of a simple HTML document

Figure 2.2 Source view of a simple HTML document

Figure 2.3 Inspect elements view of a simple HTML document

Figure 2.4 Source code of

OurFirstHTML.html

Figure 2.5 A tree perspective on

OurFirstHTML.html

(see Figure 2.4)

Chapter 3

Figure 3.1 An XML code example: James Bond movies

Figure 3.2 Tree perspective on an XML document

Figure 3.3 How RSS works

Figure 3.4 SVG code example:

R

logo

Figure 3.5 The

R

logo as SVG image from code in Figure 3.4

Figure 3.6 XML example document: stock data

Figure 3.7 DTD of stock data XML file (see Figure 3.6)

Figure 3.8

R

code for event-driven parsing

Figure 3.9 JSON code example: Indiana Jones movies

Chapter 4

Figure 4.1 A tree perspective on

parsed_doc

Figure 4.2 Visualizing node relations. Descriptions are presented in relation to the white node

Chapter 5

Figure 5.1 User–server communication via HTTP

Figure 5.2 HTTP request schema

Figure 5.3 HTTP response schema

Figure 5.4 The principle of web proxies

Figure 5.5 The principle of HTTPS

Chapter 6

Figure 6.1 Javascript-enriched

fortunes1.html

(a) Initial state

(b) After a click on “Robert Gentleman”

Figure 6.2 The user–server communication process using the XMLHttpRequest. Adapted from Stepp et al. (2012)

Figure 6.3 View on

fortunes2.html

from the Elements panel

Figure 6.4 View on

fortunes2.html

from the Network panel

Figure 6.5 Information on

quotes.html

from the Network panel

(a) Preview

(b) Headers

Chapter 7

Figure 7.1 How users,

R

, SQL, DBMS, and databases are related to each other

Figure 7.2 Database scheme

Figure 7.3 SQL example database scheme

Chapter 9

Figure 9.1 Screenshot of HTTP authentication mask at http://www.r-datacollection.com/ materials/solutions

Figure 9.2 The Federal Contributions database

Figure 9.3 Initializing the Selenium Java Server

Figure 9.4 The mechanics of web APIs

Figure 9.5 An

R

wrapper function for Yahoo's Weather Feed

Figure 9.6 Scraping with regular expressions

Figure 9.7 Scraping with XPath

Figure 9.8 Data collection with APIs

Figure 9.9

R

code for parsing

robots.txt

files

Figure 9.10 An etiquette manual for web scraping

Figure 9.11 Helper functions for handling HTTP If-Modified-Since header field

Chapter 10

Figure 10.1 Output of hierarchical clustering of UK Government press releases

Figure 10.2 Output of Correlated Topic Model of UK Government press releases

Chapter 11

Figure 11.1 Time-series of Apple stock values, 2003–2013

Figure 11.2 Trigger selection on Windows platform

Figure 11.3 Action selection on Windows platform

Chapter 12

Figure 12.1 R procedure to collect list of bill sponsors

Figure 12.2 Cosponsorship network of senators

Chapter 13

Figure 13.1 Excerpt from a text file on temperature data from Californian weather stations, accessible at ftp://ftp.wcc.nrcs.usda.gov/data/climate/table/temperature/history/california/

Figure 13.2

R

-based parsing function for temperature text files

Figure 13.3 Weather station locations on an OpenStreetMaps map

Figure 13.4 Overall monthly temperature means for selected weather stations. Lines present average monthly temperatures in degree Celsius for all years in the dataset. Small gray dots are daily temperatures for all years within the dataset.

Chapter 14

Figure 14.1 Tweets per hour on the 2014 Academy Awards

Chapter 15

Figure 15.1 Excerpt from the

robots.txt

file on www.dastelefonbuch.de

Figure 15.2 Geographic distribution of “Feuersteins”

Figure 15.3 Generalized

R

code to scrape entries from

www.dastelefonbuch.de

Figure 15.4 Generalized

R

code to parse entries from

www.dastelefonbuch.de

Figure 15.5 Generalized

R

code to map entries from

www.dastelefonbuch.de

Figure 15.6 Results of three calls of the

namesPlot()

function

Chapter 16

Figure 16.1 Amazon's search form

Figure 16.2 Prices, costumer rating, and best seller positioning of mobile phones. Black dots mark placement of individual products and white dots with horizontal and vertical lines mark the five best selling items per plot

Chapter 17

Figure 17.1 Violin plots of estimated sentiment versus product rating in Amazon reviews

Figure 17.2 Estimated sentiment in Amazon review titles versus product rating. The data are jittered on both axes.

Figure 17.3 Maximum entropy classification results of Amazon reviews

Figure 17.4 Support vector machine classification results of Amazon reviews

Figure 17.4 Support vector machine classification results of Amazon reviews

preface

Figure 1 The research process not using R—stylized example

Figure 2 The research process using R—stylized example

Guide

Cover

Table of Contents

Part

Pages

xv

xvi

xvii

xviii

xix

xx

xxi

xxii

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

61

62

63

64

65

66

67

68

69

70

71

72

73

74

75

76

77

78

79

80

81

82

83

84

85

86

87

89

90

91

92

93

94

95

96

97

98

99

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125

126

127

128

129

130

131

132

133

136

137

138

139

140

141

142

143

144

146

147

148

149

150

151

152

153

154

155

156

157

158

159

160

161

162

163

164

165

166

167

168

169

170

171

172

173

174

175

176

177

178

179

180

181

182

183

184

185

186

187

188

189

190

191

192

193

194

195

196

197

198

199

200

201

202

203

204

205

206

207

208

209

210

211

212

213

214

215

216

217

218

219

221

222

223

224

225

226

227

228

229

230

231

232

233

234

235

236

237

238

239

240

241

242

243

244

245

246

247

248

249

250

251

252

253

254

255

257

258

259

260

261

262

263

264

265

266

267

268

269

270

271

272

273

274

275

276

277

278

279

280

281

282

283

284

286

287

288

289

290

291

292

293

294

295

296

297

298

299

300

301

302

303

304

305

306

307

308

309

310

311

312

313

314

315

316

317

318

320

321

322

323

324

325

326

327

328

329

330

331

332

333

334

335

336

337

338

339

341

342

343

344

345

346

347

348

350

351

352

353

354

355

356

357

358

359

360

361

362

363

364

365

366

368

369

370

371

372

373

374

375

376

377

378

379

380

381

382

383

384

385

386

387

388

389

390

391

393

395

396

397

398

399

400

401

402

403

404

405

406

407

408

409

410

411

412

413

414

415

416

417

418

419

420

421

422

423

424

425

426

427

428

429

430

431

432

433

434

435

436

437

438

439

440

441

442

443

444

445

446

447

448

449

450

451

452

Preface

The rapid growth of the World Wide Web over the past two decades tremendously changed the way we share, collect, and publish data. Firms, public institutions, and private users provide every imaginable type of information and new channels of communication generate vast amounts of data on human behavior. What was once a fundamental problem for the social sciences—the scarcity and inaccessibility of observations—is quickly turning into an abundance of data. This turn of events does not come without problems. For example, traditional techniques for collecting and analyzing data may no longer suffice to overcome the tangled masses of data. One consequence of the need to make sense of such data has been the inception of “data scientists,” who sift through data and are greatly sought after by researchers and businesses alike.

Along with the triumphant entry of the World Wide Web, we have witnessed a second trend, the increasing popularity and power of open-source software like R. For quantitative social scientists, R is among the most important statistical software. It is growing rapidly due to an active community that constantly publishes new packages. Yet, R is more than a free statistics suite. It also incorporates interfaces to many other programming languages and software solutions, thus greatly simplifying work with data from various sources.

On a personal note, we can say the following about our work with social scientific data:

our financial resources are sparse;

we have little time or desire to collect data by hand;

we are interested in working with up-to-date, high quality, and data-rich sources; and

we want to document our research from the beginning (data collection) to the end (publication), so that it can be reproduced.

In the past, we frequently found ourselves being inconvenienced by the need to manually assemble data from various sources, thereby hoping that the inevitable coding and copy-and-paste errors are unsystematic. Eventually we grew weary of collecting research data in a non-reproducible manner that is prone to errors, cumbersome, and subject to heightened risks of death by boredom. Consequently, we have increasingly incorporated the data collection and publication processes into our familiar software environment that already helps with statistical analyses—R. The program offers a great infrastructure to expand the daily workflow to steps before and after the actual data analysis.

Although R is not about to collect survey data on its own or conduct experiments any time soon, we do consider the techniques presented in this book as more than the “the poor man's substitute” for costly surveys, experiments, and student-assistant coders. We believe that they are a powerful supplement to the portfolio of modern data analysts. We value the collection of data from online resources not only as a more cost-sensitive solution compared to traditional data acquisition methods, but increasingly think of it as the exclusive approach to assemble datasets from new and developing sources. Moreover, we cherish program-based solutions because they guarantee reliability, reproducibility, time-efficiency, and assembly of higher quality datasets. Beyond productivity, you might find that you enjoy writing code and drafting algorithmic solutions to otherwise tedious manual labor. In short, we are convinced that if you are willing to make the investment and adopt the techniques proposed in this book, you will benefit from a lasting improvement in the ease and quality with which you conduct your data analyses.

If you have identified online data as an appropriate resource for your project, is web scraping or statistical text processing and therefore an automated or semi-automated data collection procedure really necessary? While we cannot hope to offer any definitive guidelines, here are some useful criteria. If you find yourself answering several of these affirmatively, an automated approach might be the right choice:

Do you plan to repeat the task from time to time, for example, in order to update your database?

Do you want others to be able to replicate your data collection process?

Do you deal with online sources of data frequently?

Is the task non-trivial in terms of scope and complexity?

If the task can also be accomplished manually—do you lack the resources to let others do the work?

Are you willing to automate processes by means of programming?

Ideally, the techniques presented in this book enable you to create powerful collections of existing, but unstructured or unsorted data no one has analyzed before at very reasonable cost. In many cases, you will not get far without rethinking, refining, and combining the proposed techniques due to your subjects’ specifics. In any case, we hope you find the topics of this book inspiring and perhaps even eye opening: The streets of the Web are paved with data that cannot wait to be collected.

What you won't learn from reading this book

When you browse the table of contents, you get a first impression of what you can expect to learn from reading this book. As it is hard to identify parts that you might have hoped for but that are in fact not covered in this book, we will name some aspects that you will not find in this volume.

What you will not get in this book is an introduction to the R environment. There are plenty of excellent introductions—both printed and online—and this book won't be just another addition to the pile. In case you have not previously worked with R, there is no reason to set this book aside in disappointment. In the next section we'll suggest some well-written R introductions.

You should also not expect the definitive guide to web scraping or text mining. First, we focus on a software environment that was not specifically tailored to these purposes. There might be applications where R is not the ideal solution for your task and other software solutions might be more suited. We will not bother you with alternative environments such as PHP, Python, Ruby, or Perl. To find out if this book is helpful for you, you should ask yourself whether you are already using or planning to use R for your daily work. If the answer to both questions is no, you should probably consider your alternatives. But if you already use R or intend to use it, you can spare yourself the effort to learn yet another language and stay within a familiar environment.

This book is not strictly speaking about data science either. There are excellent introductions to the topic like the recently published books by O'Neil and Schutt (2013), Torgo (2010), Zhao (2012), and Zumel and Mount (2014). What is occasionally missing in these introductions is how data for data science applications are actually acquired. In this sense, our book serves as a preparatory step for data analyses but also provides guidance on how to manage available information and keep it up to date.

Finally, what you most certainly will not get is the perfect solution to your specific problem. It is almost inherent in the data collection process that the fields where the data are harvested are never exactly alike, and sometimes rapidly change shape. Our goal is to enable you to adapt the pieces of code provided in the examples and case studies to create new pieces of code to help you succeed in collecting the data you need.

Why R?

There are many reasons why we think that R is a good solution for the problems that are covered in this book. To us, the most important points are:

R

is freely and easily accessible. You can download, install, and use it wherever and whenever you want. There are huge benefits to not being a specialist in expensive proprietary programs, as you do not depend on the willingness of employers to pay licensing fees.

For a software environment with a primarily statistical focus,

R

has a large community that continues to flourish.

R

is used by various disciplines, such as social scientists, medical scientists, psychologists, biologists, geographers, linguists, and also in business. This range allows you to share code with many developers and profit from well-documented applications in diverse settings.

R

is open source. This means that you can easily retrace how functions work and modify them with little effort. It also means that program modifications are not controlled by an exclusive team of programmers that takes care of the product. Even if you are not interested in contributing to the development of

R

, you will still reap the benefits from having access to a wide variety of optional extensions—packages. The number of packages is continuously growing and many existing packages are frequently updated. You can find nice overviews of popular themes in

R

usage on

http://cran.r-project.org/web/views/

.

R

is reasonably fast in ordinary tasks. You will likely agree with this impression if you have used other statistical software like SPSS or Stata and have gotten into the habit of going on holiday when running more complex models—not to mention the pain that is caused by the “one session, one data frame” logic. There are even extensions to speed up

R

, for example, by making C code available from within

R

, like the

Rcpp

package.

R

is powerful in creating data visualizations. Although this not an obvious plus for data collection, you would not want to miss

R

’s graphics facilities in your daily workflow. We will demonstrate how a visual inspection of collected data can and should be a first step in data validation, and how graphics provide an intuitive way of summarizing large amounts of data.

Work in

R

is mainly command line based. This might sound like a disadvantage to

R

rookies, but it is the only way to allow for the production of reproducible results compared to point-and-click programs.

R

is not picky about operating systems. It can generally be run under Windows, Mac OS, and Linux.

Finally,

R

is the entire package from start to finish. If you read this book, you are likely not a dedicated programmer, but hold a substantive interest in a topic or specific data source that you want to work with. In that case, learning another language will not pay off, but rather prevent you from working on your research. An example of a common research process is displayed in

Figure 1

. It is characterized by a permanent switching between programs. If you need to make corrections to the data collection process, you have to climb back down the entire ladder. The research process using

R

, as it is presented in this book, takes place within a single software environment (

Figure 2

). In the context of web scraping and text processing, this means that you do not have to learn another programming language for the task. What you will need to learn are some basics in the markup languages HTML and XML and the logic of regular expressions and XPath, but the operations are executed from within

R

.

Figure 1 The research process not using R—stylized example

Figure 2 The research process using R—stylized example

Recommended reading to get started with R

There are many well-written books on the market that provide great introductions to R. Among these, we find the following especially helpful:

Crawley, Michael J. 2012.

The R Book

, 2nd edition. Hoboken, NJ: John Wiley & Sons.

Adler, Joseph. 2009.

R in a Nutshell. A Desktop Quick Reference.

Sebastopol, CA: O'Reilly.

Teetor, Paul. 2011.

R Cookbook.

Sebastopol, CA: O'Reilly.

Besides these commercial sources, there is also a lot of free information on the Web. A truly amazing online tutorial for absolute beginners by the Code School is made available at http://tryr.codeschool.com/. Additionally, Quick-R (http://www.statmethods.net/) is a good reference site for many basic commands. Lastly, you can also find a lot of free resources and examples at http://www.ats.ucla.edu/stat/r/.

R is an ever-growing software, and in order to keep track of the developments you might periodically like to visit some of the following websites: PlanetR (http://planetr.stderr.org/) provides the history of existing packages and occasionally some interesting applications. R-Bloggers (http://www.r-bloggers.com/) is a blog aggregator that collects entries from many R-related blog sites in various fields. It offers a broad view on hundreds of R applications from economics to biology to geography that is mostly accompanied by the necessary code to replicate the posts. R-Bloggers even features some basic examples that deal with automated data collection.

When running into problems, R help files are sometimes not too helpful. It is often more enlightening to look for help in online forums like Stack Overflow (http://stackoverflow.com) or other sites from the Stack Exchange network. For complex problems, consider the R experts on GitHub (http://github.com). Also note that there are many Special Interest Group (SIG) mailing lists (http://www.r-project.org/mail.html) on a variety of topics and even local R User Groups all around the world (http://blog.revolutionanalytics.com/local-r-groups.html). Finally, a CRAN Task View has been set up, which gives a nice overview over recent advances in web technologies and services in the R framework: http://cran.r-project.org/web/views/WebTechnologies.html

Typographic conventions

This is a practical book about coding, and we expect you to often have it sitting somewhere next to the keyboard. We want to facilitate the orientation throughout the book with the following conventions: There are three indices—one for general topics, one for R packages, and one for R functions. Within the text, variables and R (and other) code and functions are set in typewriter typeface, as in summary(). Actual R code is also typewriter style and indented. Note that code input is indicated with “R” and a prompt symbol (“R>”); R output is printed without the prompt sign, as in

R> hello <- ”hello, world” R> hello [1] ”hello, world”

The book's website

The website that accompanies this book can be found at http://www.r-datacollection.com

Among other things, the site provides code from examples and case studies. This means that you do not have to manually copy the code from the book, but can directly access and modify the corresponding R files. You will also find solutions to some of the exercises, as well as a list of errata. If you find any errors, please do not hesitate to let us know.

Disclaimer

This is not a book about spidering the Web. Spiders are programs that graze the Web for information, rapidly jumping from one page to another, often grabbing the entire page content. If you want to follow in Google's Googlebot's footsteps, you probably hold the wrong book in your hand. The techniques we introduce in this book are meant to serve more specific and more gentle purposes, that is, scraping specific information from specific websites. In the end, you are responsible for what you do with what you learn. It is frequently not a big leap from the code that is presented in this book to programs that might quickly annoy website administrators. So here is some fundamental advice on how to behave as a practitioner of web data collection:

Always keep in mind where your data comes from and, whenever possible, give credit to those who originally collected and published it.

1

Do not violate copyrights if you plan to republish data you found on the Web. If the information was not collected by yourself, chances are that you need permission from the owners to reproduce them.

Do not do anything illegal! To get an idea of what you can and cannot do in your data collection, check out the Justia BlawgSearch (

http://blawgsearch.justia.com/

), which is a search site for legal blogs. Looking for entries marked ‘web scraping’ might help to keep up to date regarding legal developments and recent verdicts. The Electronic Frontier Foundation (

http://www.eff.org/

) was founded as early as 1990 to defend the digital rights of consumers and the public. We hope, however, that you will never have to rely on their help.

We offer some more detailed recommendations on how to behave when scraping content from the Web in Section 9.3.3.

Acknowledgments

Many people helped to make this project possible. We would like to take the opportunity to express our gratitude to them. First of all, we would like to say thanks to Peter Selb to whom we owe the idea of creating a course on alternative data collection. It is due to his impulse that we started to assemble our somewhat haphazard experiences in a comprehensive volume. We are also grateful to several people who have provided invaluable feedback on parts of the book. Most importantly we thank Christian Breunig, Holger Döring, Daniel Eckert, Johannes Kleibl, Philip Leifeld, and Nils Weidmann, whose advice has greatly improved the material. We also thank Kathryn Uhrig for proofreading the manuscript.

Early versions of the book were used in two courses on “Alternative data collection methods” and “Data collection in the World Wide Web” that took place in the summer terms of 2012 and 2013 at the University of Konstanz. We are grateful to students for their comments—and their patience with the topic, with R, and outrageous regular expressions. We would also like to thank the participants of the workshops on “Facilitating empirical research on political reforms: Automating data collection in R” held in Mannheim in December 2012 and the workshop “Automating online data collection in R,” which took place in Zurich in April 2013. We thank Bruno Wüest in particular for his assistance in making the Zurich workshop possible, and Fabrizio Gilardi for his support.

It turns out that writing a volume on automating data collection is a surprisingly time-consuming endeavor. We all embarked on this project during our doctoral studies and devoted a lot of time to learning the intricacies of web scraping that could have been spent on the tasks we signed up for. We would like to thank our supervisors Peter Selb, Daniel Bochsler, Ulrich Sieberer, and Thomas Gschwend for their patience and support for our various detours. Christian Rubba is grateful for generous funding by the Swiss National Science Foundation (Grant Number 137805).

We would like to acknowledge that we are heavily indebted to the creators and maintainers of the numerous packages that are applied throughout this volume. Their continuous efforts have opened the door for new ways of scholarly research—and have provided access to vast sources of data to individual researchers. While we cannot possibly hope to mention all the package developers in these paragraphs, we would like to express our gratitude to Duncan Temple Lang and Hadley Wickham for their exceptional work. We would also like to acknowledge the work of Yihui Xie, whose package was crucial in typesetting this book.

We are grateful for the help that was extended from our publisher, particularly from Heather Kay, Debbie Jupe, Jo Taylor, Richard Davies, Baljinder Kaur and others who were responsible for proofreading and formatting and who provided support at various stages of the writing process.

Finally, we happily acknowledge the great support we received from our friends and families. We owe special and heartfelt thanks to: Karima Bousbah, Johanna Flock, Hans-Holger Friedrich, Dirk Heinecke, Stefanie Klingler, Kristin Lindemann, Verena Mack, and Alice Mohr.

Simon MunzertChristian RubbaPeter MeißnerDominic Nyhuis

Note

1

To lead by example, we owe some of the suggestions to Hemenway and Calishain (2003)’s

Spidering Hacks

(Hack #6).

1Introduction

Are you ready for your first encounter with web scraping? Let us start with a small example that you can recreate directly on your machine, provided you have R installed. The case study gives a first impression of the book's central themes.

Lesen Sie weiter in der vollständigen Ausgabe!

Lesen Sie weiter in der vollständigen Ausgabe!

Lesen Sie weiter in der vollständigen Ausgabe!

Lesen Sie weiter in der vollständigen Ausgabe!

Lesen Sie weiter in der vollständigen Ausgabe!

Lesen Sie weiter in der vollständigen Ausgabe!

Lesen Sie weiter in der vollständigen Ausgabe!

Lesen Sie weiter in der vollständigen Ausgabe!

Lesen Sie weiter in der vollständigen Ausgabe!

Lesen Sie weiter in der vollständigen Ausgabe!

Lesen Sie weiter in der vollständigen Ausgabe!

Lesen Sie weiter in der vollständigen Ausgabe!

Lesen Sie weiter in der vollständigen Ausgabe!

Lesen Sie weiter in der vollständigen Ausgabe!

Lesen Sie weiter in der vollständigen Ausgabe!

Lesen Sie weiter in der vollständigen Ausgabe!

Lesen Sie weiter in der vollständigen Ausgabe!

Lesen Sie weiter in der vollständigen Ausgabe!

Lesen Sie weiter in der vollständigen Ausgabe!

Lesen Sie weiter in der vollständigen Ausgabe!

Lesen Sie weiter in der vollständigen Ausgabe!

Lesen Sie weiter in der vollständigen Ausgabe!

Lesen Sie weiter in der vollständigen Ausgabe!

Lesen Sie weiter in der vollständigen Ausgabe!