Python Web Scraping Cookbook - Michael Heydt - E-Book

Python Web Scraping Cookbook E-Book

Michael Heydt

0,0
32,39 €

-100%
Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.

Mehr erfahren.
Beschreibung

Untangle your web scraping complexities and access web data with ease using Python scripts

Key Features

  • Hands-on recipes for advancing your web scraping skills to expert level
  • One-stop solution guide to address complex and challenging web scraping tasks using Python
  • Understand web page structures and collect data from a website with ease

Book Description

Python Web Scraping Cookbook is a solution-focused book that will teach you techniques to develop high-performance Scrapers, and deal with cookies, hidden form fields, Ajax-based sites and proxies. You'll explore a number of real-world scenarios where every part of the development or product life cycle will be fully covered. You will not only develop the skills to design reliable, high-performing data flows, but also deploy your codebase to Amazon Web Services (AWS). If you are involved in software engineering, product development, or data mining or in building data-driven products, you will find this book useful as each recipe has a clear purpose and objective.

Right from extracting data from websites to writing a sophisticated web crawler, the book's independent recipes will be extremely helpful while on the job. This book covers Python libraries, requests, and BeautifulSoup. You will learn about crawling, web spidering, working with AJAX websites, and paginated items. You will also understand to tackle problems such as 403 errors, working with proxy, scraping images, and LXML.

By the end of this book, you will be able to scrape websites more efficiently and deploy and operate your scraper in the cloud.

What you will learn

  • Use a variety of tools to scrape any website and data, including Scrapy and Selenium
  • Master expression languages, such as XPath and CSS, and regular expressions to extract web data
  • Deal with scraping traps such as hidden form fields, throttling, pagination, and different status codes
  • Build robust scraping pipelines with SQS and RabbitMQ
  • Scrape assets like image media and learn what to do when Scraper fails to run
  • Explore ETL techniques of building a customized crawler, parser, and convert structured and unstructured data from websites
  • Deploy and run your scraper as a service in AWS Elastic Container Service

Who this book is for

This book is ideal for Python programmers, web administrators, security professionals, and anyone who wants to perform web analytics. Familiarity with Python and basic understanding of web scraping will be useful to make the best of this book.

Michael Heydt is an independent consultant specializing in social, mobile, analytics, and cloud technologies, with an emphasis on cloud native 12-factor applications. Michael has been a software developer and trainer for over 30 years and is the author of books such as D3.js By Example, Learning Pandas, Mastering Pandas for Finance, and Instant Lucene.NET. You can find more information about him on LinkedIn at michaelheydt.

Sie lesen das E-Book in den Legimi-Apps auf:

Android
iOS
von Legimi
zertifizierten E-Readern

Seitenzahl: 350

Veröffentlichungsjahr: 2018

Bewertungen
0,0
0
0
0
0
0
Mehr Informationen
Mehr Informationen
Legimi prüft nicht, ob Rezensionen von Nutzern stammen, die den betreffenden Titel tatsächlich gekauft oder gelesen/gehört haben. Wir entfernen aber gefälschte Rezensionen.



Python Web Scraping Cookbook

 

 

Over 90 proven recipes to get you scraping with Python, microservices, Docker, and AWS

 

 

 

 

 

Michael Heydt

 

 

 

 

BIRMINGHAM - MUMBAI

Python Web Scraping Cookbook

Copyright © 2018 Packt Publishing

All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author(s), nor Packt Publishing or its dealers and distributors, will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book.

Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

Commissioning Editor: Veena PagareAcquisition Editor: Tushar GuptaContent Development Editor: Tejas LimkarTechnical Editor: Danish ShaikhCopy Editor: Safis EditingProject Coordinator: Manthan PatelProofreader: Safis EditingIndexer: Rekha NairGraphics: Tania DuttaProduction Coordinator: Shraddha Falebhai

First published: February 2018

Production reference: 1070218

Published by Packt Publishing Ltd. Livery Place 35 Livery Street Birmingham B3 2PB, UK.

ISBN  978-1-78728-521-7

www.packtpub.com

Contributors

About the author

Michael Heydt is an independent consultant specializing in social, mobile, analytics, and cloud technologies, with an emphasis on cloud native 12-factor applications. Michael has been a software developer and trainer for over 30 years and is the author of books such as D3.js By Example, Learning Pandas, Mastering Pandas for Finance, and Instant Lucene.NET. You can find more information about him on LinkedIn at michaelheydt.

I would like to greatly thank my family for putting up with me disappearing for months on end and sacrificing my sparse free time to indulge in creation of content and books like this one.  They are my true inspiration and enablers.

About the reviewers

Mei Lu is the founder and CEO of Jobfully, providing career coaching for software developers and engineering leaders. She is also a Career/Executive Coach for Carnegie Mellon University Alumni Association, specializing in the software / high-tech industry. Previously, Mei was a software engineer and an engineering manager at Qpass, M.I.T., and MicroStrategy. She received her MS in Computer Science from the University of Pennsylvania and her MS in Engineering from Carnegie Mellon University.

 

Lazar Telebak is a freelance web developer specializing in web scraping, crawling, and indexing web pages using Python libraries/frameworks. He has worked mostly on projects of automation, website scraping, crawling, and exporting data in various formats (CSV, JSON, XML, and TXT) and databases such as (MongoDB, SQLAlchemy, and Postgres). Lazar also has experience of fronted technologies and languages such as HTML, CSS, JavaScript, and jQuery.

 

 

Packt is searching for authors like you

If you're interested in becoming an author for Packt, please visit authors.packtpub.com and apply today. We have worked with thousands of developers and tech professionals, just like you, to help them share their insight with the global tech community. You can make a general application, apply for a specific hot topic that we are recruiting an author for, or submit your own idea.

mapt.io

Mapt is an online digital library that gives you full access to over 5,000 books and videos, as well as industry leading tools to help you plan your personal development and advance your career. For more information, please visit our website.

Why subscribe?

Spend less time learning and more time coding with practical eBooks and Videos from over 4,000 industry professionals

Improve your learning with Skill Plans built especially for you

Get a free eBook or video every month

Mapt is fully searchable

Copy and paste, print, and bookmark content

PacktPub.com

Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.PacktPub.com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at [email protected] for more details.

At www.PacktPub.com, you can also read a collection of free technical articles, sign up for a range of free newsletters, and receive exclusive discounts and offers on Packt books and eBooks.

Table of Contents

Title Page

Copyright and Credits

Python Web Scraping Cookbook

Contributors

About the author

About the reviewers

Packt is searching for authors like you

Packt Upsell

Why subscribe?

PacktPub.com

Preface

Who this book is for

What this book covers

To get the most out of this book

Download the example code files

Conventions used

Get in touch

Reviews

Getting Started with Scraping

Introduction

Setting up a Python development environment 

Getting ready

How to do it...

Scraping Python.org with Requests and Beautiful Soup

Getting ready...

How to do it...

How it works...

Scraping Python.org in urllib3 and Beautiful Soup

Getting ready...

How to do it...

How it works

There's more...

Scraping Python.org with Scrapy

Getting ready...

How to do it...

How it works

Scraping Python.org with Selenium and PhantomJS

Getting ready

How to do it...

How it works

There's more...

Data Acquisition and Extraction

Introduction

How to parse websites and navigate the DOM using BeautifulSoup

Getting ready

How to do it...

How it works

There's more...

Searching the DOM with Beautiful Soup's find methods

Getting ready

How to do it...

Querying the DOM with XPath and lxml

Getting ready

How to do it...

How it works

There's more...

Querying data with XPath and CSS selectors

Getting ready

How to do it...

How it works

There's more...

Using Scrapy selectors

Getting ready

How to do it...

How it works

There's more...

Loading data in unicode / UTF-8

Getting ready

How to do it...

How it works

There's more...

Processing Data

Introduction

Working with CSV and JSON data

Getting ready

How to do it

How it works

There's more...

Storing data using AWS S3

Getting ready

How to do it

How it works

There's more...

Storing data using MySQL

Getting ready

How to do it

How it works

There's more...

Storing data using PostgreSQL

Getting ready

How to do it

How it works

There's more...

Storing data in Elasticsearch

Getting ready

How to do it

How it works

There's more...

How to build robust ETL pipelines with AWS SQS

Getting ready

How to do it - posting messages to an AWS queue

How it works

How to do it - reading and processing messages

How it works

There's more...

Working with Images, Audio, and other Assets

Introduction

Downloading media content from the web

Getting ready

How to do it

How it works

There's more...

 Parsing a URL with urllib to get the filename

Getting ready

How to do it

How it works

There's more...

Determining the type of content for a URL 

Getting ready

How to do it

How it works

There's more...

Determining the file extension from a content type

Getting ready

How to do it

How it works

There's more...

Downloading and saving images to the local file system

How to do it

How it works

There's more...

Downloading and saving images to S3

Getting ready

How to do it

How it works

There's more...

 Generating thumbnails for images

Getting ready

How to do it

How it works

Taking a screenshot of a website

Getting ready

How to do it

How it works

Taking a screenshot of a website with an external service

Getting ready

How to do it

How it works

There's more...

Performing OCR on an image with pytesseract

Getting ready

How to do it

How it works

There's more...

Creating a Video Thumbnail

Getting ready

How to do it

How it works

There's more..

Ripping an MP4 video to an MP3

Getting ready

How to do it

There's more...

Scraping - Code of Conduct

Introduction

Scraping legality and scraping politely

Getting ready

How to do it

Respecting robots.txt

Getting ready

How to do it

How it works

There's more...

Crawling using the sitemap

Getting ready

How to do it

How it works

There's more...

Crawling with delays

Getting ready

How to do it

How it works

There's more...

Using identifiable user agents 

How to do it

How it works

There's more...

Setting the number of concurrent requests per domain

How it works

Using auto throttling

How to do it

How it works

There's more...

Using an HTTP cache for development

How to do it

How it works

There's more...

Scraping Challenges and Solutions

Introduction

Retrying failed page downloads

How to do it

How it works

Supporting page redirects

How to do it

How it works

Waiting for content to be available in Selenium

How to do it

How it works

Limiting crawling to a single domain

How to do it

How it works

Processing infinitely scrolling pages

Getting ready

How to do it

How it works

There's more...

Controlling the depth of a crawl

How to do it

How it works

Controlling the length of a crawl

How to do it

How it works

Handling paginated websites

Getting ready

How to do it

How it works

There's more...

Handling forms and forms-based authorization

Getting ready

How to do it

How it works

There's more...

Handling basic authorization

How to do it

How it works

There's more...

Preventing bans by scraping via proxies

Getting ready

How to do it

How it works

Randomizing user agents

How to do it

Caching responses

How to do it

There's more...

Text Wrangling and Analysis

Introduction

Installing NLTK

How to do it

Performing sentence splitting

How to do it

There's more...

Performing tokenization

How to do it

Performing stemming

How to do it

Performing lemmatization

How to do it

Determining and removing stop words

How to do it

There's more...

Calculating the frequency distributions of words

How to do it

There's more...

Identifying and removing rare words

How to do it

Identifying and removing rare words

How to do it

Removing punctuation marks

How to do it

There's more...

Piecing together n-grams

How to do it

There's more...

Scraping a job listing from StackOverflow 

Getting ready

How to do it

There's more...

Reading and cleaning the description in the job listing

Getting ready

How to do it...

Searching, Mining and Visualizing Data

Introduction

Geocoding an IP address

Getting ready

How to do it

How to collect IP addresses of Wikipedia edits

Getting ready

How to do it

How it works

There's more...

Visualizing contributor location frequency on Wikipedia

How to do it

Creating a word cloud from a StackOverflow job listing

Getting ready

How to do it

Crawling links on Wikipedia

Getting ready

How to do it

How it works

Theres more...

Visualizing page relationships on Wikipedia

Getting ready

How to do it

How it works

There's more...

Calculating degrees of separation

How to do it

How it works

There's more...

Creating a Simple Data API

Introduction

Creating a REST API with Flask-RESTful

Getting ready

How to do it

How it works

There's more...

Integrating the REST API with scraping code

Getting ready

How to do it

Adding an API to find the skills for a job listing

Getting ready

How to do it

Storing data in Elasticsearch as the result of a scraping request

Getting ready

How to do it

How it works

There's more...

Checking Elasticsearch for a listing before scraping

How to do it

There's more...

Creating Scraper Microservices with Docker

Introduction

Installing Docker

Getting ready

How to do it

Installing a RabbitMQ container from Docker Hub

Getting ready

How to do it

Running a Docker container (RabbitMQ)

Getting ready

How to do it

There's more...

Creating and running an Elasticsearch container

How to do it

Stopping/restarting a container and removing the image

How to do it

There's more...

Creating a generic microservice with Nameko

Getting ready

How to do it

How it works

There's more...

Creating a scraping microservice

How to do it

There's more...

Creating a scraper container

Getting ready

How to do it

How it works

Creating an API container

Getting ready

How to do it

There's more...

Composing and running the scraper locally with docker-compose

Getting ready

How to do it

There's more...

Making the Scraper as a Service Real

Introduction

Creating and configuring an Elastic Cloud trial account

How to do it

Accessing the Elastic Cloud cluster with curl

How to do it

Connecting to the Elastic Cloud cluster with Python

Getting ready

How to do it

There's more...

Performing an Elasticsearch query with the Python API 

Getting ready

How to do it

There's more...

Using Elasticsearch to query for jobs with specific skills

Getting ready

How to do it

Modifying the API to search for jobs by skill

How to do it

How it works

There's more...

Storing configuration in the environment 

How to do it

Creating an AWS IAM user and a key pair for ECS

Getting ready

How to do it

Configuring Docker to authenticate with ECR

Getting ready

How to do it

Pushing containers into ECR

Getting ready

How to do it

Creating an ECS cluster

How to do it

Creating a task to run our containers

Getting ready

How to do it

How it works

Starting and accessing the containers in AWS

Getting ready

How to do it

There's more...

Other Books You May Enjoy

Leave a review - let other readers know what you think

Preface

The internet contains a wealth of data. This data is both provided through structured APIs as well as by content delivered directly through websites. While the data in APIs is highly structured, information found in web pages is often unstructured and requires collection, extraction, and processing to be of value. And collecting data is just the start of the journey, as that data must also be stored, mined, and then exposed to others in a value-added form.

With this book, you will learn many of the core tasks needed in collecting various forms of information from websites. We will cover how to collect it, how to perform several common data operations (including storage in local and remote databases), how to perform common media-based tasks such as converting images an videos to thumbnails, how to clean unstructured data with NTLK, how to examine several data mining and visualization tools, and finally core skills in building a microservices-based scraper and API that can, and will, be run on the cloud.

Through a recipe-based approach, we will learn independent techniques to solve specific tasks involved in not only scraping but also data manipulation and management, data mining, visualization, microservices, containers, and cloud operations. These recipes will build skills in a progressive and holistic manner, not only teaching how to perform the fundamentals of scraping but also taking you from the results of scraping to a service offered to others through the cloud. We will be building an actual web-scraper-as-a-service using common tools in the Python, container, and cloud ecosystems.

Who this book is for

This book is for those who want to learn to extract data from websites using the process of scraping and also how to work with various data management tools and cloud services. The coding will require basic skills in the Python programming language.

The book is also for those who wish to learn about a larger ecosystem of tools for retrieving, storing, and searching data, as well as using modern tools and Pythonic libraries to create data APIs and cloud services. You may also be using Docker and Amazon Web Services to package and deploy a scraper on the cloud.

What this book covers

Chapter 1, Getting Started with Scraping, introduces several concepts and tools for web scraping. We will examine how to install and do basic tasks with tools such as requests, urllib, BeautifulSoup, Scrapy, PhantomJS and Selenium.

Chapter 2, Data Acquisition and Extraction, is based on an understanding of the structure of HTML and how to find and extract embedded data. We will cover many of the concepts in the DOM and how to find and extract data using BeautifulSoup, XPath, LXML, and CSS selectors. We also briefly examine working with Unicode / UTF8.

Chapter 3, Processing Data, teaches you to load and manipulate data in many formats, and then how to store that data in various data stores (S3, MySQL, PostgreSQL, and ElasticSearch). Data in web pages is represented in various formats, the most common being HTML, JSON, CSV, and XML We will also examine the use of message queue systems, primarily AWS SQS, to help build robust data processing pipelines.

Chapter 4, Working with Images, Audio and other Assets, examines the means of retrieving multimedia items, storing them locally, and also performing several tasks such as OCR, generating thumbnails, making web page screenshots, audio extraction from videos, and finding all video URLs in a YouTube playlist.

Chapter 5, Scraping – Code of Conduct, covers several concepts involved in the legality of scraping, and practices for performing polite scraping. We will examine tools for processing robots.txt and sitemaps to respect the web host's desire for acceptable behavior. We will also examine the control of several facets of crawling, such as using delays, containing the depth and length of crawls, using user agents, and implementing caching to prevent repeated requests.

Chapter 6, Scraping Challenges and Solutions, covers many of the challenges that writing a robust scraper is rife with, and how to handle many scenarios. These scenarios are pagination, redirects, login forms, keeping the crawler within the same domain, retrying requests upon failure, and handling captchas.

Chapter 7, Text Wrangling and Analysis, examines various tools such as using NLTK for natural language processing and how to remove common noise words and punctuation. We often need to process the textual content of a web page to find information on the page that is part of the text and neither structured/embedded data nor multimedia. This requires knowledge of using various concepts and tools to clean and understand text. 

Chapter 8, Searching, Mining, and Visualizing Data, covers several means of searching for data on the Web, storing and organizing data, and deriving results from the identified relationships. We will see how to understand the geographic locations of contributors to Wikipedia, finding relationships between actors on IMDB, and finding jobs on Stack Overflow that match specific technologies.

Chapter 9, Creating a Simple Data API, teaches us how to create a scraper as a service. We will create a REST API for a scraper using Flask. We will run the scraper as a service behind this API and be able to submit requests to scrape specific pages, in order to dynamically query data from a scrape as well as a local ElasticSearch instance.

Chapter 10, Creating Scraper Microservices with Docker, continues the growth of our scraper as a service by packaging the service and API in a Docker swarm and distributing requests across scrapers via a message queuing system (AWS SQS). We will also cover scaling of scraper instances up and down using Docker swarm tools.

Chapter 11, Making the Scraper as a Service Real, concludes by fleshing out the services crated in the previous chapter to add a scraper that pulls together various concepts covered earlier. This scraper can assist in analyzing job posts on StackOverflow to find and compare employers using specified technologies. The service will collect posts and allow a query to find and compare those companies.

To get the most out of this book

The primary tool required for the recipes in this book is a Python 3 interpreter. The recipes have been written using the free version of the Anaconda Python distribution, specifically version 3.6.1. Other Python version 3 distributions should work well but have not been tested.

The code in the recipes will often require the use of various Python libraries. These are all available for installation using pip and accessible using pip install. Wherever required, these installations will be elaborated in the recipes.

Several recipes require an Amazon AWS account. AWS accounts are available for the first year for free-tier access. The recipes will not require anything more than free-tier services. A new account can be created at https://portal.aws.amazon.com/billing/signup.

Several recipes will utilize Elasticsearch. There is a free, open source version available on GitHub at https://github.com/elastic/elasticsearch, with installation instructions on that page. Elastic.co also offers a fully capable version (also with Kibana and Logstash) hosted on the cloud with a 14-day free trial available at http://info.elastic.co (which we will utilize). There is a version for docker-compose with all x-pack features available at https://github.com/elastic/stack-docker, all of which can be started with a simple docker-compose up command.

Finally, several of the recipes use MySQL and PostgreSQL as database examples and several common clients for those databases. For those recipes, these will need to be installed locally. MySQL Community Server is available at https://dev.mysql.com/downloads/mysql/, and PostgreSQL can be found at https://www.postgresql.org/.

We will also look at creating and using docker containers for several of the recipes. Docker CE is free and is available at https://www.docker.com/community-edition.

Download the example code files

You can download the example code files for this book from your account at www.packtpub.com. If you purchased this book elsewhere, you can visit www.packtpub.com/support and register to have the files emailed directly to you.

You can download the code files by following these steps:

Log in or register at

www.packtpub.com

.

Select the

SUPPORT

tab.

Click on

Code Downloads & Errata

.

Enter the name of the book in the

Search

box and follow the onscreen instructions.

Once the file is downloaded, please make sure that you unzip or extract the folder using the latest version of:

WinRAR/7-Zip for Windows

Zipeg/iZip/UnRarX for Mac

7-Zip/PeaZip for Linux

The code bundle for the book is also hosted on GitHub athttps://github.com/PacktPublishing/Python-Web-Scraping-Cookbook. We also have other code bundles from our rich catalog of books and videos available athttps://github.com/PacktPublishing/. Check them out!

Get in touch

Feedback from our readers is always welcome.

General feedback: Email [email protected] and mention the book title in the subject of your message. If you have questions about any aspect of this book, please email us at [email protected].

Errata: Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you have found a mistake in this book, we would be grateful if you would report this to us. Please visit www.packtpub.com/submit-errata, selecting your book, clicking on the Errata Submission Form link, and entering the details.

Piracy: If you come across any illegal copies of our works in any form on the internet, we would be grateful if you would provide us with the location address or website name. Please contact us at [email protected] with a link to the material.

If you are interested in becoming an author: If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, please visit authors.packtpub.com.

Reviews

Please leave a review. Once you have read and used this book, why not leave a review on the site that you purchased it from? Potential readers can then see and use your unbiased opinion to make purchase decisions, we at Packt can understand what you think about our products, and our authors can see your feedback on their book. Thank you!

For more information about Packt, please visit packtpub.com.

Getting Started with Scraping

In this chapter, we will cover the following topics:

Setting up a Python development environment 

Scraping Python.org with Requests and Beautiful Soup

Scraping Python.org with urllib3 and Beautiful Soup

Scraping Python.org with Scrapy

Scraping Python.org with Selenium and PhantomJs

Introduction

The amount of data available on the web is consistently growing both in quantity and in form.  Businesses require this data to make decisions, particularly with the explosive growth of machine learning tools which require large amounts of data for training.  Much of this data is available via Application Programming Interfaces, but at the same time a lot of valuable data is still only available through the process of web scraping.

This chapter will focus on several fundamentals of setting up a scraping environment and performing basic requests for data with several of the tools of the trade.  Python is the programing language of choice for this book, as well as amongst many who build systems to perform scraping.  It is an easy to use programming language which has a very rich ecosystem of tools for many tasks.  If you program in other languages, you will find it easy to pick up and you may never go back!

Setting up a Python development environment 

If you have not used Python before, it is important to have a working development environment. The recipes in this book will be all in Python and be a mix of interactive examples, but primarily implemented as scripts to be interpreted by the Python interpreter. This recipe will show you how to set up an isolated development environment withvirtualenvand manage project dependencies withpip . We also get the code for the book and install it into the Python virtual environment.

Getting ready

We will exclusively be using Python 3.x, and specifically in my case 3.6.1.  While Mac and Linux normally have Python version 2 installed, and Windows systems do not. So it is likely that in any case that Python 3 will need to be installed.  You can find references for Python installers at www.python.org.

You can check Python's version with python --version

pip comes installed with Python 3.x, so we will omit instructions on its installation.  Additionally, all command line examples in this book are run on a Mac.  For Linux users the commands should be identical.  On Windows, there are alternate commands (like dir instead of ls), but these alternatives will not be covered.

How to do it...

We will be installing a number of packages with pip.  These packages are installed into a Python environment.  There often can be version conflicts with other packages, so a good practice for following along with the recipes in the book will be to create a new virtual Python environment where the packages we will use will be ensured to work properly.

Virtual Python environments are managed with the virtualenv tool.  This can be installed with the following command:

~ $ pip install virtualenvCollecting virtualenv Using cached virtualenv-15.1.0-py2.py3-none-any.whlInstalling collected packages: virtualenvSuccessfully installed virtualenv-15.1.0

Now we can use virtualenv.  But before that let's briefly look at pip. This command installs Python packages from PyPI, a package repository with literally 10's of thousands of packages.  We just saw using the install subcommand to pip, which ensures a package is installed.  We can also see all currently installed packages with pip list:

~ $ pip listalabaster (0.7.9)amqp (1.4.9)anaconda-client (1.6.0)anaconda-navigator (1.5.3)anaconda-project (0.4.1)aniso8601 (1.3.0)

I've truncated to the first few lines as there are quite a few.  For me there are 222 packages installed.

Packages can also be uninstalled using pip uninstall followed by the package name.  I'll leave it to you to give it a try.Now back to virtualenv. Using virtualenv is very simple.  Let's use it to create an environment and install the code from github. Let's walk through the steps:

Create a directory to represent the project and enter the directory.

~ $ mkdir pywscb~ $ cd pywscb

Initialize a virtual environment folder named env:

pywscb $ virtualenv envUsing base prefix '/Users/michaelheydt/anaconda'New python executable in /Users/michaelheydt/pywscb/env/bin/pythoncopying /Users/michaelheydt/anaconda/bin/python => /Users/michaelheydt/pywscb/env/bin/pythoncopying /Users/michaelheydt/anaconda/bin/../lib/libpython3.6m.dylib => /Users/michaelheydt/pywscb/env/lib/libpython3.6m.dylibInstalling setuptools, pip, wheel...done.

This creates an env folder.  Let's take a look at what was installed.

pywscb $ ls -la envtotal 8drwxr-xr-x 6 michaelheydt staff 204 Jan 18 15:38 .drwxr-xr-x 3 michaelheydt staff 102 Jan 18 15:35 ..drwxr-xr-x 16 michaelheydt staff 544 Jan 18 15:38 bindrwxr-xr-x 3 michaelheydt staff 102 Jan 18 15:35 includedrwxr-xr-x 4 michaelheydt staff 136 Jan 18 15:38 lib-rw-r--r-- 1 michaelheydt staff 60 Jan 18 15:38 pip-selfcheck.json

New we activate the virtual environment.  This command uses the content in the

env

folder to configure Python. After this all python activities are relative to this virtual environment.

pywscb $ source env/bin/activate(env) pywscb $

We can check that python is indeed using this virtual environment with the following command:

(env) pywscb $ which python/Users/michaelheydt/pywscb/env/bin/python

With our virtual environment created, let's clone the books sample code and take a look at its structure.  

(env) pywscb $ git clone https://github.com/PacktBooks/PythonWebScrapingCookbook.git Cloning into 'PythonWebScrapingCookbook'... remote: Counting objects: 420, done. remote: Compressing objects: 100% (316/316), done. remote: Total 420 (delta 164), reused 344 (delta 88), pack-reused 0 Receiving objects: 100% (420/420), 1.15 MiB | 250.00 KiB/s, done. Resolving deltas: 100% (164/164), done. Checking connectivity... done.

This created a PythonWebScrapingCookbook directory.  

(env) pywscb $ ls -l total 0 drwxr-xr-x 9 michaelheydt staff 306 Jan 18 16:21 PythonWebScrapingCookbook drwxr-xr-x 6 michaelheydt staff 204 Jan 18 15:38 env

Let's change into it and examine the content.

(env) PythonWebScrapingCookbook $ ls -l total 0 drwxr-xr-x 15 michaelheydt staff 510 Jan 18 16:21 py drwxr-xr-x 14 michaelheydt staff 476 Jan 18 16:21 www

There are two directories.  Most the the Python code is is the py directory. www contains some web content that we will use from time-to-time using a local web server.  Let's look at the contents of the py directory:

(env) py $ ls -l total 0 drwxr-xr-x 9 michaelheydt staff 306 Jan 18 16:21 01 drwxr-xr-x 25 michaelheydt staff 850 Jan 18 16:21 03 drwxr-xr-x 21 michaelheydt staff 714 Jan 18 16:21 04 drwxr-xr-x 10 michaelheydt staff 340 Jan 18 16:21 05 drwxr-xr-x 14 michaelheydt staff 476 Jan 18 16:21 06 drwxr-xr-x 25 michaelheydt staff 850 Jan 18 16:21 07 drwxr-xr-x 14 michaelheydt staff 476 Jan 18 16:21 08 drwxr-xr-x 7 michaelheydt staff 238 Jan 18 16:21 09 drwxr-xr-x 7 michaelheydt staff 238 Jan 18 16:21 10 drwxr-xr-x 9 michaelheydt staff 306 Jan 18 16:21 11 drwxr-xr-x 8 michaelheydt staff 272 Jan 18 16:21 modules

Code for each chapter is in the numbered folder matching the chapter (there is no code for chapter 2 as it is all interactive Python).

Note that there is a modules folder.  Some of the recipes throughout the book use code in those modules.  Make sure that your Python path points to this folder.  On Mac and Linux you can sets this in your .bash_profile file (and environments variables dialog on Windows):

export PYTHONPATH="/users/michaelheydt/dropbox/packt/books/pywebscrcookbook/code/py/modules"export PYTHONPATH

The contents in each folder generally follows a numbering scheme matching the sequence of the recipe in the chapter.  The following is the contents of the chapter 6 folder:

(env) py $ ls -la 06 total 96 drwxr-xr-x 14 michaelheydt staff 476 Jan 18 16:21 . drwxr-xr-x 14 michaelheydt staff 476 Jan 18 16:26 .. -rw-r--r-- 1 michaelheydt staff 902 Jan 18 16:21 01_scrapy_retry.py -rw-r--r-- 1 michaelheydt staff 656 Jan 18 16:21 02_scrapy_redirects.py -rw-r--r-- 1 michaelheydt staff 1129 Jan 18 16:21 03_scrapy_pagination.py -rw-r--r-- 1 michaelheydt staff 488 Jan 18 16:21 04_press_and_wait.py -rw-r--r-- 1 michaelheydt staff 580 Jan 18 16:21 05_allowed_domains.py -rw-r--r-- 1 michaelheydt staff 826 Jan 18 16:21 06_scrapy_continuous.py -rw-r--r-- 1 michaelheydt staff 704 Jan 18 16:21 07_scrape_continuous_twitter.py -rw-r--r-- 1 michaelheydt staff 1409 Jan 18 16:21 08_limit_depth.py -rw-r--r-- 1 michaelheydt staff 526 Jan 18 16:21 09_limit_length.py -rw-r--r-- 1 michaelheydt staff 1537 Jan 18 16:21 10_forms_auth.py -rw-r--r-- 1 michaelheydt staff 597 Jan 18 16:21 11_file_cache.py -rw-r--r-- 1 michaelheydt staff 1279 Jan 18 16:21 12_parse_differently_based_on_rules.py

In the recipes I'll state that we'll be using the script in <chapter directory>/<recipe filename>.

Congratulations, you've now got a Python environment configured with the books code! 

Now just the be complete, if you want to get out of the Python virtual environment, you can exit using the following command:

(env) py $ deactivate py $

And checking which python we can see it has switched back:

py $ which python /Users/michaelheydt/anaconda/bin/python

I won't be using the virtual environment for the rest of the book. When you see command prompts they will be either of the form "<directory> $" or simply "$".

Now let's move onto doing some scraping.

Scraping Python.org with Requests and Beautiful Soup

In this recipe we will install Requests and Beautiful Soup and scrape some content from www.python.org.  We'll install both of the libraries and get some basic familiarity with them.  We'll come back to them both in subsequent chapters and dive deeper into each.

Getting ready...

In this recipe, we will scrape the upcoming Python events from https://www.python.org/events/pythonevents. The following is an an example of The Python.org Events Page (it changes frequently, so your experience will differ):

We will need to ensure that Requests and Beautiful Soup are installed.  We can do that with the following:

pywscb

$ pip install requestsDownloading/unpacking requests Downloading requests-2.18.4-py2.py3-none-any.whl (88kB): 88kB downloadedDownloading/unpacking certifi>=2017.4.17 (from requests) Downloading certifi-2018.1.18-py2.py3-none-any.whl (151kB): 151kB downloadedDownloading/unpacking idna>=2.5,<2.7 (from requests) Downloading idna-2.6-py2.py3-none-any.whl (56kB): 56kB downloadedDownloading/unpacking chardet>=3.0.2,<3.1.0 (from requests) Downloading chardet-3.0.4-py2.py3-none-any.whl (133kB): 133kB downloadedDownloading/unpacking urllib3>=1.21.1,<1.23 (from requests) Downloading urllib3-1.22-py2.py3-none-any.whl (132kB): 132kB downloadedInstalling collected packages: requests, certifi, idna, chardet, urllib3Successfully installed requests certifi idna chardet urllib3Cleaning up...pywscb $ pip install bs4Downloading/unpacking bs4 Downloading bs4-0.0.1.tar.gz Running setup.py (path:/Users/michaelheydt/pywscb/env/build/bs4/setup.py) egg_info for package bs4

How it works...

We will dive into details of both Requests and Beautiful Soup in the next chapter, but for now let's just summarize a few key points about how this works.  The following important points about Requests:

Requests is used to execute HTTP requests.  We used it to make a GET verb request of the URL for the events page.

The Requests object holds the results of the request.  This is not only the page content, but also many other items about the result such as HTTP status codes and headers.

Requests is used only to get the page, it does not do an parsing.

We use Beautiful Soup to do the parsing of the HTML and also the finding of content within the HTML. 

To understand how this worked, the content of the page has the following HTML to start the Upcoming Events section:

We used the power of Beautiful Soup to:

Find the

<ul>

element representing the section, which is found by looking for a

<ul>

with the a

class

attribute that has a value of

list-recent-events

.

From that object, we find all the

<li>

elements. 

Each of these