32,39 €
Untangle your web scraping complexities and access web data with ease using Python scripts
Python Web Scraping Cookbook is a solution-focused book that will teach you techniques to develop high-performance Scrapers, and deal with cookies, hidden form fields, Ajax-based sites and proxies. You'll explore a number of real-world scenarios where every part of the development or product life cycle will be fully covered. You will not only develop the skills to design reliable, high-performing data flows, but also deploy your codebase to Amazon Web Services (AWS). If you are involved in software engineering, product development, or data mining or in building data-driven products, you will find this book useful as each recipe has a clear purpose and objective.
Right from extracting data from websites to writing a sophisticated web crawler, the book's independent recipes will be extremely helpful while on the job. This book covers Python libraries, requests, and BeautifulSoup. You will learn about crawling, web spidering, working with AJAX websites, and paginated items. You will also understand to tackle problems such as 403 errors, working with proxy, scraping images, and LXML.
By the end of this book, you will be able to scrape websites more efficiently and deploy and operate your scraper in the cloud.
This book is ideal for Python programmers, web administrators, security professionals, and anyone who wants to perform web analytics. Familiarity with Python and basic understanding of web scraping will be useful to make the best of this book.
Michael Heydt is an independent consultant specializing in social, mobile, analytics, and cloud technologies, with an emphasis on cloud native 12-factor applications. Michael has been a software developer and trainer for over 30 years and is the author of books such as D3.js By Example, Learning Pandas, Mastering Pandas for Finance, and Instant Lucene.NET. You can find more information about him on LinkedIn at michaelheydt.Sie lesen das E-Book in den Legimi-Apps auf:
Seitenzahl: 350
Veröffentlichungsjahr: 2018
Copyright © 2018 Packt Publishing
All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.
Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author(s), nor Packt Publishing or its dealers and distributors, will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book.
Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.
Commissioning Editor: Veena PagareAcquisition Editor: Tushar GuptaContent Development Editor: Tejas LimkarTechnical Editor: Danish ShaikhCopy Editor: Safis EditingProject Coordinator: Manthan PatelProofreader: Safis EditingIndexer: Rekha NairGraphics: Tania DuttaProduction Coordinator: Shraddha Falebhai
First published: February 2018
Production reference: 1070218
Published by Packt Publishing Ltd. Livery Place 35 Livery Street Birmingham B3 2PB, UK.
ISBN 978-1-78728-521-7
www.packtpub.com
Michael Heydt is an independent consultant specializing in social, mobile, analytics, and cloud technologies, with an emphasis on cloud native 12-factor applications. Michael has been a software developer and trainer for over 30 years and is the author of books such as D3.js By Example, Learning Pandas, Mastering Pandas for Finance, and Instant Lucene.NET. You can find more information about him on LinkedIn at michaelheydt.
Mei Lu is the founder and CEO of Jobfully, providing career coaching for software developers and engineering leaders. She is also a Career/Executive Coach for Carnegie Mellon University Alumni Association, specializing in the software / high-tech industry. Previously, Mei was a software engineer and an engineering manager at Qpass, M.I.T., and MicroStrategy. She received her MS in Computer Science from the University of Pennsylvania and her MS in Engineering from Carnegie Mellon University.
Lazar Telebak is a freelance web developer specializing in web scraping, crawling, and indexing web pages using Python libraries/frameworks. He has worked mostly on projects of automation, website scraping, crawling, and exporting data in various formats (CSV, JSON, XML, and TXT) and databases such as (MongoDB, SQLAlchemy, and Postgres). Lazar also has experience of fronted technologies and languages such as HTML, CSS, JavaScript, and jQuery.
If you're interested in becoming an author for Packt, please visit authors.packtpub.com and apply today. We have worked with thousands of developers and tech professionals, just like you, to help them share their insight with the global tech community. You can make a general application, apply for a specific hot topic that we are recruiting an author for, or submit your own idea.
Mapt is an online digital library that gives you full access to over 5,000 books and videos, as well as industry leading tools to help you plan your personal development and advance your career. For more information, please visit our website.
Spend less time learning and more time coding with practical eBooks and Videos from over 4,000 industry professionals
Improve your learning with Skill Plans built especially for you
Get a free eBook or video every month
Mapt is fully searchable
Copy and paste, print, and bookmark content
Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.PacktPub.com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at [email protected] for more details.
At www.PacktPub.com, you can also read a collection of free technical articles, sign up for a range of free newsletters, and receive exclusive discounts and offers on Packt books and eBooks.
Title Page
Copyright and Credits
Python Web Scraping Cookbook
Contributors
About the author
About the reviewers
Packt is searching for authors like you
Packt Upsell
Why subscribe?
PacktPub.com
Preface
Who this book is for
What this book covers
To get the most out of this book
Download the example code files
Conventions used
Get in touch
Reviews
Getting Started with Scraping
Introduction
Setting up a Python development environment 
Getting ready
How to do it...
Scraping Python.org with Requests and Beautiful Soup
Getting ready...
How to do it...
How it works...
Scraping Python.org in urllib3 and Beautiful Soup
Getting ready...
How to do it...
How it works
There's more...
Scraping Python.org with Scrapy
Getting ready...
How to do it...
How it works
Scraping Python.org with Selenium and PhantomJS
Getting ready
How to do it...
How it works
There's more...
Data Acquisition and Extraction
Introduction
How to parse websites and navigate the DOM using BeautifulSoup
Getting ready
How to do it...
How it works
There's more...
Searching the DOM with Beautiful Soup's find methods
Getting ready
How to do it...
Querying the DOM with XPath and lxml
Getting ready
How to do it...
How it works
There's more...
Querying data with XPath and CSS selectors
Getting ready
How to do it...
How it works
There's more...
Using Scrapy selectors
Getting ready
How to do it...
How it works
There's more...
Loading data in unicode / UTF-8
Getting ready
How to do it...
How it works
There's more...
Processing Data
Introduction
Working with CSV and JSON data
Getting ready
How to do it
How it works
There's more...
Storing data using AWS S3
Getting ready
How to do it
How it works
There's more...
Storing data using MySQL
Getting ready
How to do it
How it works
There's more...
Storing data using PostgreSQL
Getting ready
How to do it
How it works
There's more...
Storing data in Elasticsearch
Getting ready
How to do it
How it works
There's more...
How to build robust ETL pipelines with AWS SQS
Getting ready
How to do it - posting messages to an AWS queue
How it works
How to do it - reading and processing messages
How it works
There's more...
Working with Images, Audio, and other Assets
Introduction
Downloading media content from the web
Getting ready
How to do it
How it works
There's more...
 Parsing a URL with urllib to get the filename
Getting ready
How to do it
How it works
There's more...
Determining the type of content for a URL 
Getting ready
How to do it
How it works
There's more...
Determining the file extension from a content type
Getting ready
How to do it
How it works
There's more...
Downloading and saving images to the local file system
How to do it
How it works
There's more...
Downloading and saving images to S3
Getting ready
How to do it
How it works
There's more...
 Generating thumbnails for images
Getting ready
How to do it
How it works
Taking a screenshot of a website
Getting ready
How to do it
How it works
Taking a screenshot of a website with an external service
Getting ready
How to do it
How it works
There's more...
Performing OCR on an image with pytesseract
Getting ready
How to do it
How it works
There's more...
Creating a Video Thumbnail
Getting ready
How to do it
How it works
There's more..
Ripping an MP4 video to an MP3
Getting ready
How to do it
There's more...
Scraping - Code of Conduct
Introduction
Scraping legality and scraping politely
Getting ready
How to do it
Respecting robots.txt
Getting ready
How to do it
How it works
There's more...
Crawling using the sitemap
Getting ready
How to do it
How it works
There's more...
Crawling with delays
Getting ready
How to do it
How it works
There's more...
Using identifiable user agents 
How to do it
How it works
There's more...
Setting the number of concurrent requests per domain
How it works
Using auto throttling
How to do it
How it works
There's more...
Using an HTTP cache for development
How to do it
How it works
There's more...
Scraping Challenges and Solutions
Introduction
Retrying failed page downloads
How to do it
How it works
Supporting page redirects
How to do it
How it works
Waiting for content to be available in Selenium
How to do it
How it works
Limiting crawling to a single domain
How to do it
How it works
Processing infinitely scrolling pages
Getting ready
How to do it
How it works
There's more...
Controlling the depth of a crawl
How to do it
How it works
Controlling the length of a crawl
How to do it
How it works
Handling paginated websites
Getting ready
How to do it
How it works
There's more...
Handling forms and forms-based authorization
Getting ready
How to do it
How it works
There's more...
Handling basic authorization
How to do it
How it works
There's more...
Preventing bans by scraping via proxies
Getting ready
How to do it
How it works
Randomizing user agents
How to do it
Caching responses
How to do it
There's more...
Text Wrangling and Analysis
Introduction
Installing NLTK
How to do it
Performing sentence splitting
How to do it
There's more...
Performing tokenization
How to do it
Performing stemming
How to do it
Performing lemmatization
How to do it
Determining and removing stop words
How to do it
There's more...
Calculating the frequency distributions of words
How to do it
There's more...
Identifying and removing rare words
How to do it
Identifying and removing rare words
How to do it
Removing punctuation marks
How to do it
There's more...
Piecing together n-grams
How to do it
There's more...
Scraping a job listing from StackOverflow 
Getting ready
How to do it
There's more...
Reading and cleaning the description in the job listing
Getting ready
How to do it...
Searching, Mining and Visualizing Data
Introduction
Geocoding an IP address
Getting ready
How to do it
How to collect IP addresses of Wikipedia edits
Getting ready
How to do it
How it works
There's more...
Visualizing contributor location frequency on Wikipedia
How to do it
Creating a word cloud from a StackOverflow job listing
Getting ready
How to do it
Crawling links on Wikipedia
Getting ready
How to do it
How it works
Theres more...
Visualizing page relationships on Wikipedia
Getting ready
How to do it
How it works
There's more...
Calculating degrees of separation
How to do it
How it works
There's more...
Creating a Simple Data API
Introduction
Creating a REST API with Flask-RESTful
Getting ready
How to do it
How it works
There's more...
Integrating the REST API with scraping code
Getting ready
How to do it
Adding an API to find the skills for a job listing
Getting ready
How to do it
Storing data in Elasticsearch as the result of a scraping request
Getting ready
How to do it
How it works
There's more...
Checking Elasticsearch for a listing before scraping
How to do it
There's more...
Creating Scraper Microservices with Docker
Introduction
Installing Docker
Getting ready
How to do it
Installing a RabbitMQ container from Docker Hub
Getting ready
How to do it
Running a Docker container (RabbitMQ)
Getting ready
How to do it
There's more...
Creating and running an Elasticsearch container
How to do it
Stopping/restarting a container and removing the image
How to do it
There's more...
Creating a generic microservice with Nameko
Getting ready
How to do it
How it works
There's more...
Creating a scraping microservice
How to do it
There's more...
Creating a scraper container
Getting ready
How to do it
How it works
Creating an API container
Getting ready
How to do it
There's more...
Composing and running the scraper locally with docker-compose
Getting ready
How to do it
There's more...
Making the Scraper as a Service Real
Introduction
Creating and configuring an Elastic Cloud trial account
How to do it
Accessing the Elastic Cloud cluster with curl
How to do it
Connecting to the Elastic Cloud cluster with Python
Getting ready
How to do it
There's more...
Performing an Elasticsearch query with the Python API 
Getting ready
How to do it
There's more...
Using Elasticsearch to query for jobs with specific skills
Getting ready
How to do it
Modifying the API to search for jobs by skill
How to do it
How it works
There's more...
Storing configuration in the environment 
How to do it
Creating an AWS IAM user and a key pair for ECS
Getting ready
How to do it
Configuring Docker to authenticate with ECR
Getting ready
How to do it
Pushing containers into ECR
Getting ready
How to do it
Creating an ECS cluster
How to do it
Creating a task to run our containers
Getting ready
How to do it
How it works
Starting and accessing the containers in AWS
Getting ready
How to do it
There's more...
Other Books You May Enjoy
Leave a review - let other readers know what you think
The internet contains a wealth of data. This data is both provided through structured APIs as well as by content delivered directly through websites. While the data in APIs is highly structured, information found in web pages is often unstructured and requires collection, extraction, and processing to be of value. And collecting data is just the start of the journey, as that data must also be stored, mined, and then exposed to others in a value-added form.
With this book, you will learn many of the core tasks needed in collecting various forms of information from websites. We will cover how to collect it, how to perform several common data operations (including storage in local and remote databases), how to perform common media-based tasks such as converting images an videos to thumbnails, how to clean unstructured data with NTLK, how to examine several data mining and visualization tools, and finally core skills in building a microservices-based scraper and API that can, and will, be run on the cloud.
Through a recipe-based approach, we will learn independent techniques to solve specific tasks involved in not only scraping but also data manipulation and management, data mining, visualization, microservices, containers, and cloud operations. These recipes will build skills in a progressive and holistic manner, not only teaching how to perform the fundamentals of scraping but also taking you from the results of scraping to a service offered to others through the cloud. We will be building an actual web-scraper-as-a-service using common tools in the Python, container, and cloud ecosystems.
This book is for those who want to learn to extract data from websites using the process of scraping and also how to work with various data management tools and cloud services. The coding will require basic skills in the Python programming language.
The book is also for those who wish to learn about a larger ecosystem of tools for retrieving, storing, and searching data, as well as using modern tools and Pythonic libraries to create data APIs and cloud services. You may also be using Docker and Amazon Web Services to package and deploy a scraper on the cloud.
Chapter 1, Getting Started with Scraping, introduces several concepts and tools for web scraping. We will examine how to install and do basic tasks with tools such as requests, urllib, BeautifulSoup, Scrapy, PhantomJS and Selenium.
Chapter 2, Data Acquisition and Extraction, is based on an understanding of the structure of HTML and how to find and extract embedded data. We will cover many of the concepts in the DOM and how to find and extract data using BeautifulSoup, XPath, LXML, and CSS selectors. We also briefly examine working with Unicode / UTF8.
Chapter 3, Processing Data, teaches you to load and manipulate data in many formats, and then how to store that data in various data stores (S3, MySQL, PostgreSQL, and ElasticSearch). Data in web pages is represented in various formats, the most common being HTML, JSON, CSV, and XML We will also examine the use of message queue systems, primarily AWS SQS, to help build robust data processing pipelines.
Chapter 4, Working with Images, Audio and other Assets, examines the means of retrieving multimedia items, storing them locally, and also performing several tasks such as OCR, generating thumbnails, making web page screenshots, audio extraction from videos, and finding all video URLs in a YouTube playlist.
Chapter 5, Scraping – Code of Conduct, covers several concepts involved in the legality of scraping, and practices for performing polite scraping. We will examine tools for processing robots.txt and sitemaps to respect the web host's desire for acceptable behavior. We will also examine the control of several facets of crawling, such as using delays, containing the depth and length of crawls, using user agents, and implementing caching to prevent repeated requests.
Chapter 6, Scraping Challenges and Solutions, covers many of the challenges that writing a robust scraper is rife with, and how to handle many scenarios. These scenarios are pagination, redirects, login forms, keeping the crawler within the same domain, retrying requests upon failure, and handling captchas.
Chapter 7, Text Wrangling and Analysis, examines various tools such as using NLTK for natural language processing and how to remove common noise words and punctuation. We often need to process the textual content of a web page to find information on the page that is part of the text and neither structured/embedded data nor multimedia. This requires knowledge of using various concepts and tools to clean and understand text.
Chapter 8, Searching, Mining, and Visualizing Data, covers several means of searching for data on the Web, storing and organizing data, and deriving results from the identified relationships. We will see how to understand the geographic locations of contributors to Wikipedia, finding relationships between actors on IMDB, and finding jobs on Stack Overflow that match specific technologies.
Chapter 9, Creating a Simple Data API, teaches us how to create a scraper as a service. We will create a REST API for a scraper using Flask. We will run the scraper as a service behind this API and be able to submit requests to scrape specific pages, in order to dynamically query data from a scrape as well as a local ElasticSearch instance.
Chapter 10, Creating Scraper Microservices with Docker, continues the growth of our scraper as a service by packaging the service and API in a Docker swarm and distributing requests across scrapers via a message queuing system (AWS SQS). We will also cover scaling of scraper instances up and down using Docker swarm tools.
Chapter 11, Making the Scraper as a Service Real, concludes by fleshing out the services crated in the previous chapter to add a scraper that pulls together various concepts covered earlier. This scraper can assist in analyzing job posts on StackOverflow to find and compare employers using specified technologies. The service will collect posts and allow a query to find and compare those companies.
The primary tool required for the recipes in this book is a Python 3 interpreter. The recipes have been written using the free version of the Anaconda Python distribution, specifically version 3.6.1. Other Python version 3 distributions should work well but have not been tested.
The code in the recipes will often require the use of various Python libraries. These are all available for installation using pip and accessible using pip install. Wherever required, these installations will be elaborated in the recipes.
Several recipes require an Amazon AWS account. AWS accounts are available for the first year for free-tier access. The recipes will not require anything more than free-tier services. A new account can be created at https://portal.aws.amazon.com/billing/signup.
Several recipes will utilize Elasticsearch. There is a free, open source version available on GitHub at https://github.com/elastic/elasticsearch, with installation instructions on that page. Elastic.co also offers a fully capable version (also with Kibana and Logstash) hosted on the cloud with a 14-day free trial available at http://info.elastic.co (which we will utilize). There is a version for docker-compose with all x-pack features available at https://github.com/elastic/stack-docker, all of which can be started with a simple docker-compose up command.
Finally, several of the recipes use MySQL and PostgreSQL as database examples and several common clients for those databases. For those recipes, these will need to be installed locally. MySQL Community Server is available at https://dev.mysql.com/downloads/mysql/, and PostgreSQL can be found at https://www.postgresql.org/.
We will also look at creating and using docker containers for several of the recipes. Docker CE is free and is available at https://www.docker.com/community-edition.
You can download the example code files for this book from your account at www.packtpub.com. If you purchased this book elsewhere, you can visit www.packtpub.com/support and register to have the files emailed directly to you.
You can download the code files by following these steps:
Log in or register at
www.packtpub.com
.
Select the
SUPPORT
tab.
Click on
Code Downloads & Errata
.
Enter the name of the book in the
Search
box and follow the onscreen instructions.
Once the file is downloaded, please make sure that you unzip or extract the folder using the latest version of:
WinRAR/7-Zip for Windows
Zipeg/iZip/UnRarX for Mac
7-Zip/PeaZip for Linux
The code bundle for the book is also hosted on GitHub athttps://github.com/PacktPublishing/Python-Web-Scraping-Cookbook. We also have other code bundles from our rich catalog of books and videos available athttps://github.com/PacktPublishing/. Check them out!
Feedback from our readers is always welcome.
General feedback: Email [email protected] and mention the book title in the subject of your message. If you have questions about any aspect of this book, please email us at [email protected].
Errata: Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you have found a mistake in this book, we would be grateful if you would report this to us. Please visit www.packtpub.com/submit-errata, selecting your book, clicking on the Errata Submission Form link, and entering the details.
Piracy: If you come across any illegal copies of our works in any form on the internet, we would be grateful if you would provide us with the location address or website name. Please contact us at [email protected] with a link to the material.
If you are interested in becoming an author: If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, please visit authors.packtpub.com.
Please leave a review. Once you have read and used this book, why not leave a review on the site that you purchased it from? Potential readers can then see and use your unbiased opinion to make purchase decisions, we at Packt can understand what you think about our products, and our authors can see your feedback on their book. Thank you!
For more information about Packt, please visit packtpub.com.
In this chapter, we will cover the following topics:
Setting up a Python development environment
Scraping Python.org with Requests and Beautiful Soup
Scraping Python.org with urllib3 and Beautiful Soup
Scraping Python.org with Scrapy
Scraping Python.org with Selenium and PhantomJs
The amount of data available on the web is consistently growing both in quantity and in form. Businesses require this data to make decisions, particularly with the explosive growth of machine learning tools which require large amounts of data for training. Much of this data is available via Application Programming Interfaces, but at the same time a lot of valuable data is still only available through the process of web scraping.
This chapter will focus on several fundamentals of setting up a scraping environment and performing basic requests for data with several of the tools of the trade. Python is the programing language of choice for this book, as well as amongst many who build systems to perform scraping. It is an easy to use programming language which has a very rich ecosystem of tools for many tasks. If you program in other languages, you will find it easy to pick up and you may never go back!
If you have not used Python before, it is important to have a working development environment. The recipes in this book will be all in Python and be a mix of interactive examples, but primarily implemented as scripts to be interpreted by the Python interpreter. This recipe will show you how to set up an isolated development environment withvirtualenvand manage project dependencies withpip . We also get the code for the book and install it into the Python virtual environment.
We will exclusively be using Python 3.x, and specifically in my case 3.6.1. While Mac and Linux normally have Python version 2 installed, and Windows systems do not. So it is likely that in any case that Python 3 will need to be installed. You can find references for Python installers at www.python.org.
You can check Python's version with python --version
We will be installing a number of packages with pip. These packages are installed into a Python environment. There often can be version conflicts with other packages, so a good practice for following along with the recipes in the book will be to create a new virtual Python environment where the packages we will use will be ensured to work properly.
Virtual Python environments are managed with the virtualenv tool. This can be installed with the following command:
~ $ pip install virtualenvCollecting virtualenv Using cached virtualenv-15.1.0-py2.py3-none-any.whlInstalling collected packages: virtualenvSuccessfully installed virtualenv-15.1.0
Now we can use virtualenv. But before that let's briefly look at pip. This command installs Python packages from PyPI, a package repository with literally 10's of thousands of packages. We just saw using the install subcommand to pip, which ensures a package is installed. We can also see all currently installed packages with pip list:
~ $ pip listalabaster (0.7.9)amqp (1.4.9)anaconda-client (1.6.0)anaconda-navigator (1.5.3)anaconda-project (0.4.1)aniso8601 (1.3.0)
I've truncated to the first few lines as there are quite a few. For me there are 222 packages installed.
Packages can also be uninstalled using pip uninstall followed by the package name. I'll leave it to you to give it a try.Now back to virtualenv. Using virtualenv is very simple. Let's use it to create an environment and install the code from github. Let's walk through the steps:
Create a directory to represent the project and enter the directory.
~ $ mkdir pywscb~ $ cd pywscb
Initialize a virtual environment folder named env:
pywscb $ virtualenv envUsing base prefix '/Users/michaelheydt/anaconda'New python executable in /Users/michaelheydt/pywscb/env/bin/pythoncopying /Users/michaelheydt/anaconda/bin/python => /Users/michaelheydt/pywscb/env/bin/pythoncopying /Users/michaelheydt/anaconda/bin/../lib/libpython3.6m.dylib => /Users/michaelheydt/pywscb/env/lib/libpython3.6m.dylibInstalling setuptools, pip, wheel...done.
This creates an env folder. Let's take a look at what was installed.
pywscb $ ls -la envtotal 8drwxr-xr-x 6 michaelheydt staff 204 Jan 18 15:38 .drwxr-xr-x 3 michaelheydt staff 102 Jan 18 15:35 ..drwxr-xr-x 16 michaelheydt staff 544 Jan 18 15:38 bindrwxr-xr-x 3 michaelheydt staff 102 Jan 18 15:35 includedrwxr-xr-x 4 michaelheydt staff 136 Jan 18 15:38 lib-rw-r--r-- 1 michaelheydt staff 60 Jan 18 15:38 pip-selfcheck.json
New we activate the virtual environment. This command uses the content in the
env
folder to configure Python. After this all python activities are relative to this virtual environment.
pywscb $ source env/bin/activate(env) pywscb $
We can check that python is indeed using this virtual environment with the following command:
(env) pywscb $ which python/Users/michaelheydt/pywscb/env/bin/python
With our virtual environment created, let's clone the books sample code and take a look at its structure.
(env) pywscb $ git clone https://github.com/PacktBooks/PythonWebScrapingCookbook.git Cloning into 'PythonWebScrapingCookbook'... remote: Counting objects: 420, done. remote: Compressing objects: 100% (316/316), done. remote: Total 420 (delta 164), reused 344 (delta 88), pack-reused 0 Receiving objects: 100% (420/420), 1.15 MiB | 250.00 KiB/s, done. Resolving deltas: 100% (164/164), done. Checking connectivity... done.
This created a PythonWebScrapingCookbook directory.
(env) pywscb $ ls -l total 0 drwxr-xr-x 9 michaelheydt staff 306 Jan 18 16:21 PythonWebScrapingCookbook drwxr-xr-x 6 michaelheydt staff 204 Jan 18 15:38 env
Let's change into it and examine the content.
(env) PythonWebScrapingCookbook $ ls -l total 0 drwxr-xr-x 15 michaelheydt staff 510 Jan 18 16:21 py drwxr-xr-x 14 michaelheydt staff 476 Jan 18 16:21 www
There are two directories. Most the the Python code is is the py directory. www contains some web content that we will use from time-to-time using a local web server. Let's look at the contents of the py directory:
(env) py $ ls -l total 0 drwxr-xr-x 9 michaelheydt staff 306 Jan 18 16:21 01 drwxr-xr-x 25 michaelheydt staff 850 Jan 18 16:21 03 drwxr-xr-x 21 michaelheydt staff 714 Jan 18 16:21 04 drwxr-xr-x 10 michaelheydt staff 340 Jan 18 16:21 05 drwxr-xr-x 14 michaelheydt staff 476 Jan 18 16:21 06 drwxr-xr-x 25 michaelheydt staff 850 Jan 18 16:21 07 drwxr-xr-x 14 michaelheydt staff 476 Jan 18 16:21 08 drwxr-xr-x 7 michaelheydt staff 238 Jan 18 16:21 09 drwxr-xr-x 7 michaelheydt staff 238 Jan 18 16:21 10 drwxr-xr-x 9 michaelheydt staff 306 Jan 18 16:21 11 drwxr-xr-x 8 michaelheydt staff 272 Jan 18 16:21 modules
Code for each chapter is in the numbered folder matching the chapter (there is no code for chapter 2 as it is all interactive Python).
Note that there is a modules folder. Some of the recipes throughout the book use code in those modules. Make sure that your Python path points to this folder. On Mac and Linux you can sets this in your .bash_profile file (and environments variables dialog on Windows):
export PYTHONPATH="/users/michaelheydt/dropbox/packt/books/pywebscrcookbook/code/py/modules"export PYTHONPATH
The contents in each folder generally follows a numbering scheme matching the sequence of the recipe in the chapter. The following is the contents of the chapter 6 folder:
(env) py $ ls -la 06 total 96 drwxr-xr-x 14 michaelheydt staff 476 Jan 18 16:21 . drwxr-xr-x 14 michaelheydt staff 476 Jan 18 16:26 .. -rw-r--r-- 1 michaelheydt staff 902 Jan 18 16:21 01_scrapy_retry.py -rw-r--r-- 1 michaelheydt staff 656 Jan 18 16:21 02_scrapy_redirects.py -rw-r--r-- 1 michaelheydt staff 1129 Jan 18 16:21 03_scrapy_pagination.py -rw-r--r-- 1 michaelheydt staff 488 Jan 18 16:21 04_press_and_wait.py -rw-r--r-- 1 michaelheydt staff 580 Jan 18 16:21 05_allowed_domains.py -rw-r--r-- 1 michaelheydt staff 826 Jan 18 16:21 06_scrapy_continuous.py -rw-r--r-- 1 michaelheydt staff 704 Jan 18 16:21 07_scrape_continuous_twitter.py -rw-r--r-- 1 michaelheydt staff 1409 Jan 18 16:21 08_limit_depth.py -rw-r--r-- 1 michaelheydt staff 526 Jan 18 16:21 09_limit_length.py -rw-r--r-- 1 michaelheydt staff 1537 Jan 18 16:21 10_forms_auth.py -rw-r--r-- 1 michaelheydt staff 597 Jan 18 16:21 11_file_cache.py -rw-r--r-- 1 michaelheydt staff 1279 Jan 18 16:21 12_parse_differently_based_on_rules.py
In the recipes I'll state that we'll be using the script in <chapter directory>/<recipe filename>.
Now just the be complete, if you want to get out of the Python virtual environment, you can exit using the following command:
(env) py $ deactivate py $
And checking which python we can see it has switched back:
py $ which python /Users/michaelheydt/anaconda/bin/python
Now let's move onto doing some scraping.
In this recipe we will install Requests and Beautiful Soup and scrape some content from www.python.org. We'll install both of the libraries and get some basic familiarity with them. We'll come back to them both in subsequent chapters and dive deeper into each.
In this recipe, we will scrape the upcoming Python events from https://www.python.org/events/pythonevents. The following is an an example of The Python.org Events Page (it changes frequently, so your experience will differ):
We will need to ensure that Requests and Beautiful Soup are installed. We can do that with the following:
pywscb
$ pip install requestsDownloading/unpacking requests Downloading requests-2.18.4-py2.py3-none-any.whl (88kB): 88kB downloadedDownloading/unpacking certifi>=2017.4.17 (from requests) Downloading certifi-2018.1.18-py2.py3-none-any.whl (151kB): 151kB downloadedDownloading/unpacking idna>=2.5,<2.7 (from requests) Downloading idna-2.6-py2.py3-none-any.whl (56kB): 56kB downloadedDownloading/unpacking chardet>=3.0.2,<3.1.0 (from requests) Downloading chardet-3.0.4-py2.py3-none-any.whl (133kB): 133kB downloadedDownloading/unpacking urllib3>=1.21.1,<1.23 (from requests) Downloading urllib3-1.22-py2.py3-none-any.whl (132kB): 132kB downloadedInstalling collected packages: requests, certifi, idna, chardet, urllib3Successfully installed requests certifi idna chardet urllib3Cleaning up...pywscb $ pip install bs4Downloading/unpacking bs4 Downloading bs4-0.0.1.tar.gz Running setup.py (path:/Users/michaelheydt/pywscb/env/build/bs4/setup.py) egg_info for package bs4
We will dive into details of both Requests and Beautiful Soup in the next chapter, but for now let's just summarize a few key points about how this works. The following important points about Requests:
Requests is used to execute HTTP requests. We used it to make a GET verb request of the URL for the events page.
The Requests object holds the results of the request. This is not only the page content, but also many other items about the result such as HTTP status codes and headers.
Requests is used only to get the page, it does not do an parsing.
We use Beautiful Soup to do the parsing of the HTML and also the finding of content within the HTML.
To understand how this worked, the content of the page has the following HTML to start the Upcoming Events section:
We used the power of Beautiful Soup to:
Find the
<ul>
element representing the section, which is found by looking for a
<ul>
with the a
class
attribute that has a value of
list-recent-events
.
From that object, we find all the
<li>
elements.
Each of these