E-Book
23,92 €

Go Web Scraping Quick Start Guide E-Book

Vincent Smith

0,0

23,92 €

Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.

Mehr erfahren.

Herausgeber: Packt Publishing
Kategorie: Wissenschaft und neue Technologien
Sprache: Englisch

Beschreibung

Learn how some Go-specific language features help to simplify building web scrapers along with common pitfalls and best practices regarding web scraping.

Key Features

Use Go libraries like Goquery and Colly to scrape the web

Common pitfalls and best practices to effectively scrape and crawl

Learn how to scrape using the Go concurrency model

Book Description

Web scraping is the process of extracting information from the web using various tools that perform scraping and crawling. Go is emerging as the language of choice for scraping using a variety of libraries. This book will quickly explain to you, how to scrape data data from various websites using Go libraries such as Colly and Goquery.

The book starts with an introduction to the use cases of building a web scraper and the main features of the Go programming language, along with setting up a Go environment. It then moves on to HTTP requests and responses and talks about how Go handles them. You will also learn about a number of basic web scraping etiquettes.

You will be taught how to navigate through a website, using a breadth-first and then a depth-first search, as well as find and follow links. You will get to know about the ways to track history in order to avoid loops and to protect your web scraper using proxies.

Finally the book will cover the Go concurrency model, and how to run scrapers in parallel, along with large-scale distributed web scraping.

What you will learn

Implement Cache-Control to avoid unnecessary network calls

Coordinate concurrent scrapers

Design a custom, larger-scale scraping system

Scrape basic HTML pages with Colly and JavaScript pages with chromedp

Discover how to search using the "strings" and "regexp" packages

Set up a Go development environment

Retrieve information from an HTML document

Protect your web scraper from being blocked by using proxies

Control web browsers to scrape JavaScript sites

Who this book is for

Data scientists, and web developers with a basic knowledge of Golang wanting to collect web data and analyze them for effective reporting and visualization.

Details

Das E-Book können Sie in Legimi-Apps oder einer beliebigen App lesen, die das folgende Format unterstützen:

EPUB

Seitenzahl: 142

Veröffentlichungsjahr: 2019

Bewertungen

0,0

Rezensionen(0 Rezensionen)

Leseprobe

Go Web Scraping Quick Start Guide

Implement the power of Go to scrape and crawl data

from the web

Vincent Smith

BIRMINGHAM - MUMBAI

Go Web Scraping Quick Start Guide

All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing or its dealers and distributors, will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book.

Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

Commissioning Editor: Pavan RamchandaniAcquisition Editor: Aditi GourContent Development Editor: Smit CarvalhoTechnical Editor: Surabhi KulkarniCopy Editor: Safis EditingProject Coordinator: Pragati ShuklaProofreader: Safis EditingIndexer: Mariammal ChettiyarGraphics: Alishon MendonsaProduction Coordinator: Jyoti Chauhan

First published: January 2019

Production reference: 1290119

Published by Packt Publishing Ltd. Livery Place 35 Livery Street Birmingham B3 2PB, UK.

ISBN 978-1-78961-570-8

www.packtpub.com

mapt.io

Mapt is an online digital library that gives you full access to over 5,000 books and videos, as well as industry leading tools to help you plan your personal development and advance your career. For more information, please visit our website.

Why subscribe?

Spend less time learning and more time coding with practical eBooks and Videos from over 4,000 industry professionals

Improve your learning with Skill Plans built especially for you

Get a free eBook or video every month

Mapt is fully searchable

Copy and paste, print, and bookmark content

Packt.com

Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.packt.com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at [email protected] for more details.

At www.packt.com, you can also read a collection of free technical articles, sign up for a range of free newsletters, and receive exclusive discounts and offers on Packt books and eBooks.

Contributors

About the author

Vincent Smith has been a software engineer for 10 years, having worked in various fields from health and IT to machine learning, and large-scale web scrapers. He has worked for both large-scale Fortune 500 companies and start-ups alike and has sharpened his skills from the best of both worlds. While obtaining a degree in electrical engineering, he learned the foundations of writing good code through his Java courses. These basics helped spur his career in software development early in his professional career in order to provide support for his team. He fell in love with the process of teaching computers how to behave and set him on the path he still walks today.

I would like to first and foremost thank my parents and my wife for supporting me in writing this book, and believing that I actually do have something to share. I would like to thank my co-workers, past and present, for being a shining example that impostor syndrome is all in your head and you should always share your knowledge. You were not born with your knowledge, so be the one that someone else can learn from.

About the reviewer

Ladjimi Chiheb Eddine is a professional Python/Django developer with extensive knowledge of Ethereum, Solidity, GoLang, PostgreSQL, and Bitcoin. He is an open source enthusiast who is trying to help people in Stack Overflow and many QA forums by responding to their answers.

Currently, he resides in Paris, where he works as a senior Python/Django developer.

I would like to thank my family and all those who love me for their support over the years. Without them, I would not have been able to find the strength to continue my work and improve my skills.

Packt is searching for authors like you

If you're interested in becoming an author for Packt, please visit authors.packtpub.com and apply today. We have worked with thousands of developers and tech professionals, just like you, to help them share their insight with the global tech community. You can make a general application, apply for a specific hot topic that we are recruiting an author for, or submit your own idea.

Title Page

Go Web Scraping Quick Start Guide

About Packt

Why subscribe?

Packt.com

Contributors

About the author

About the reviewer

Packt is searching for authors like you

Preface

Who this book is for

What this book covers

To get the most out of this book

Download the example code files

Conventions used

Get in touch

Reviews

Introducing Web Scraping and Go

What is web scraping?

Why do you need a web scraper?

Search engines

Price comparison

Building datasets

What is Go?

Why is Go a good fit for web scraping?

Go is fast

Go is safe

Go is simple

How to set up a Go development environment

Go language and tools

Git

Editor

Summary

The Request/Response Cycle

What do HTTP requests look like?

HTTP request methods

HTTP headers

Query parameters

Request body

What do HTTP responses look like?

Status line

Response headers

Response body

What are HTTP status codes?

100–199 range

200–299 range

300–399 range

400–499 range

500–599 range

What do HTTP requests/responses look like in Go?

A simple request example

Summary

Web Scraping Etiquette

What is a robots.txt file?

What is a User-Agent string?

Example

How to throttle your scraper

How to use caching

Cache-Control

Expires

Etag

Caching content in Go

Summary

Parsing HTML

What is the HTML format?

Syntax

Structure

Searching using the strings package

Example – Counting links

Example – Doctype check

Searching using the regexp package

Example – Finding links

Example – Finding prices

Searching using XPath queries

Example – Daily deals

Example – Collecting products

Searching using Cascading Style Sheets selectors

Example – Daily deals

Example – Collecting products

Summary

Web Scraping Navigation

Following links

Example – Daily deals

Submitting forms

Example – Submitting searches

Example – POST method

Avoiding loops

Breadth-first versus depth-first crawling

Depth-first

Breadth-first

Navigating with JavaScript

Example – Book reviews

Summary

Protecting Your Web Scraper

Virtual private servers

Proxies

Public and shared proxies

Dedicated proxies

Price

Location

Type

Anonymity

Proxies in Go

Virtual private networks

Boundaries

Whitelists

Blacklists

Summary

Scraping with Concurrency

What is concurrency

Concurrency pitfalls

Race conditions

Deadlocks

The Go concurrency model

Goroutines

Channels

sync package helpers

Conditions

Atomic counters

Summary

Scraping at 100x

Components of a web scraping system

Queue

Cache

Storage

Logs

Scraping HTML pages with colly

Scraping JavaScript pages with chrome-protocol

Example – Amazon Daily Deals

Distributed scraping with dataflowkit

The Fetch service

The Parse service

Summary

Other Books You May Enjoy

Leave a review - let other readers know what you think

Preface

The internet is a place full of interesting information and insights just waiting to be gleaned. Much like golden nuggets, these fragmented pieces of data can be collected, filtered, combined, and refined to produce extremely valuable products. Armed with the right knowledge, skills, and a little creativity, you can build a web scraper that can power multi-billion-dollar companies. To support this, you need to use the best tools for the job, starting with a programming language built for speed, simplicity, and safety.

The Go programming language combines the best ideas from its predecessors and cutting-edge ideology, leaving out the unnecessary fluff, to produce a razor-sharp set of tools and clean architecture. With the Go standard library and projects from open source contributors, you have everything you need to build a web scraper of any size.

Who this book is for

This book is for anyone with a little coding experience who is curious about how to build a web scraper that is fast and efficient.

What this book covers

Chapter 1, Introducing Web Scraping and Go, explains what web scraping is and how to install the Go programming language and tools.

Chapter 2, The Request/Response Cycle, outlines the structure of HTTP requests and responses, and explains how to use Go to make and process them.

Chapter 3, Web Scraping Etiquette, explains how to build a web scraper that uses best practices and recommendations for crawling the web efficiently, while respecting others.

Chapter 4, Parsing HTML, shows how to use various tools to parse information from HTML pages.

Chapter 5, Web Scraping Navigation, demonstrates the best ways to navigate websites efficiently.

Chapter 6, Protecting Your Web Scraper, explains how to use various tools to navigate through the internet safely and securely.

Chapter 7, Scraping with Concurrency, introduces the Go concurrency model and explains how to build a productive web scraper.

Chapter 8, Scraping at 100x, provides a blueprint for building a large-scale web scraper and provides some examples from the open source community.

To get the most out of this book

In order to get the most from this book, you should familiarize yourself with your Terminal or Command Prompt, ensure you have a good internet connection, and read each chapter, even if you think you already know it. The readers of this book should keep an open mind as to how they think a web scraper should act, and they should learn the current best practices and proper etiquette. This book also focuses on the Go programming language, covering the installation, basic commands, the standard library, and package management, so some familiarity with Go will be helpful as this book covers the language in a broad sense and only goes into the depth that is needed for web scraping. To be able to run most of the code in this book, the reader should be familiar with their Terminal or Command Prompt in order to run the examples, among other tasks.

Download the example code files

You can download the example code files for this book from your account at www.packt.com. If you purchased this book elsewhere, you can visit www.packt.com/support and register to have the files emailed directly to you.

You can download the code files by following these steps:

www.packt.com

Select the

SUPPORT

tab

Click on

Code Downloads & Errata

Enter the name of the book in the

box and follow the onscreen instructions

Once the file is downloaded, please make sure that you unzip or extract the folder using the latest version of:

WinRAR/7-Zip for Windows

Zipeg/iZip/UnRarX for Mac

7-Zip/PeaZip for Linux

The code bundle for the book is also hosted on GitHub at https://github.com/PacktPublishing/Go-Web-Scraping-Quick-Start-Guide. In case there's an update to the code, it will be updated on the existing GitHub repository.

We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/. Check them out!

Conventions used

There are a number of text conventions used throughout this book.

CodeInText: Indicates code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles. Here is an example: "This is using the net/http package's default HTTP client to request the index.html resource."

A block of code is set as follows:

POST /login HTTP/1.1Host: myprotectedsite.comContent-Type: application/x-www-form-urlencodedContent-Length: 38username=myuser&password=supersecretpw

Any command-line input or output is written as follows:

go run main.go

Bold: Indicates a new term, an important word, or words that you see onscreen. For example, words in menus or dialog boxes appear in the text like this. Here is an example: "In this case, you will receive a status code of 500 Internal Server Error."

Warnings or important notes appear like this.

Tips and tricks appear like this.

Get in touch

Feedback from our readers is always welcome.

General feedback: If you have questions about any aspect of this book, mention the book title in the subject of your message and email us at [email protected].

Errata: Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you have found a mistake in this book, we would be grateful if you would report this to us. Please visit www.packt.com/submit-errata, selecting your book, clicking on the Errata Submission Form link, and entering the details.

Piracy: If you come across any illegal copies of our works in any form on the Internet, we would be grateful if you would provide us with the location address or website name. Please contact us at [email protected] with a link to the material.

If you are interested in becoming an author: If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, please visit authors.packtpub.com.

Reviews

Please leave a review. Once you have read and used this book, why not leave a review on the site that you purchased it from? Potential readers can then see and use your unbiased opinion to make purchase decisions, we at Packt can understand what you think about our products, and our authors can see your feedback on their book. Thank you!

For more information about Packt, please visit packt.com.

Introducing Web Scraping and Go

Collecting, parsing, storing, and processing data are essential tasks that almost everybody will need to do in their software development career. Staying on top of emerging technologies that greatly improve the stability, speed, and efficiency of application development is another challenge. To provide insight into how to accomplish both of these goals, I have written this book. Here, you will find a guide for performing web scraping in Go. This book covers a broad perspective on web scraping, from the basics of the Hypertext Transfer Protocol (HTTP) and Hypertext Markup Language (HTML), to building highly concurrent distributed systems.

In this chapter, you will find explanations on the following topics:

What is web scraping?

Why do you need a web scraper?

What is Go?

Why is Go a good fit for web scraping?

How can you set up a Go development environment?

What is web scraping?

Web scraping at, its core, is collecting publicly available information from the internet for a specific purpose. It has taken form under many different names, such as following:

Spiders

Crawlers

Bots

Although the name may carry a negative connotation, the practice of web scraping has been around since the beginning of the internet and has grown into various technologies and techniques. In fact, some companies have built their entire business on web scraping!

Why do you need a web scraper?

There are many different use cases where you might need to build a web scraper. All cases center around the fact that information on the internet is often disparate, but can be very valuable when collected into one single package. Often, in these cases, the person collecting the information does not have a working or business relationship with the producers of the data, meaning they cannot request the information to be packaged and delivered to them. Because of the lack of this relationship, the one who needs the data has to rely on their own means to gather the information.