Hands-On Web Scraping with Python - Anish Chapagain - E-Book

Hands-On Web Scraping with Python E-Book

Anish Chapagain

0,0
35,99 €

-100%
Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.

Mehr erfahren.
Beschreibung

Web scraping is an essential technique used in many organizations to gather valuable data from web pages. This book will enable you to delve into web scraping techniques and methodologies.
The book will introduce you to the fundamental concepts of web scraping techniques and how they can be applied to multiple sets of web pages. You'll use powerful libraries from the Python ecosystem such as Scrapy, lxml, pyquery, and bs4 to carry out web scraping operations. You will then get up to speed with simple to intermediate scraping operations such as identifying information from web pages and using patterns or attributes to retrieve information. This book adopts a practical approach to web scraping concepts and tools, guiding you through a series of use cases and showing you how to use the best tools and techniques to efficiently scrape web pages. You'll even cover the use of other popular web scraping tools, such as Selenium, Regex, and web-based APIs.
By the end of this book, you will have learned how to efficiently scrape the web using different techniques with Python and other popular tools.

Das E-Book können Sie in Legimi-Apps oder einer beliebigen App lesen, die das folgende Format unterstützen:

EPUB

Seitenzahl: 336

Veröffentlichungsjahr: 2019

Bewertungen
0,0
0
0
0
0
0
Mehr Informationen
Mehr Informationen
Legimi prüft nicht, ob Rezensionen von Nutzern stammen, die den betreffenden Titel tatsächlich gekauft oder gelesen/gehört haben. Wir entfernen aber gefälschte Rezensionen.



Hands-On Web Scraping with Python

 

 

 

 

Perform advanced scraping operations using various Python libraries and tools such as Selenium, Regex, and others

 

 

 

 

 

 

 

 

 

 

 

 

Anish Chapagain

 

 

 

 

 

 

 

 

 

 

 

 

 

BIRMINGHAM - MUMBAI

Hands-On Web Scraping with Python

Copyright © 2019 Packt Publishing

All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing or its dealers and distributors, will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book.

Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

Commissioning Editor: Sunith ShettyAcquisition Editor: Aniruddha PatilContent Development Editor: Roshan KumarSenior Editor: Ayaan HodaTechnical Editor:Sushmeeta JenaCopy Editor: Safis EditingProject Coordinator:Namrata SwettaProofreader: Safis EditingIndexer: Tejal Daruwale SoniProduction Designer: Alishon Mendonsa

First published: June 2019

Production reference: 2120619

Published by Packt Publishing Ltd. Livery Place 35 Livery Street Birmingham B3 2PB, UK.

ISBN 978-1-78953-339-2

www.packtpub.com

To my daughter, Aasira, and my family and friends. Special thanks to Ashish Chapagain, Peter, and Prof. W.J. Teahan. This book is dedicated to you all.
 

Packt.com

Subscribe to our online digital library for full access to over 7,000 books and videos, as well as industry leading tools to help you plan your personal development and advance your career. For more information, please visit our website.

Why subscribe?

Spend less time learning and more time coding with practical eBooks and Videos from over 4,000 industry professionals

Improve your learning with Skill Plans built especially for you

Get a free eBook or video every month

Fully searchable for easy access to vital information

Copy and paste, print, and bookmark content

Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.packt.com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at [email protected] for more details.

At www.packt.com, you can also read a collection of free technical articles, sign up for a range of free newsletters, and receive exclusive discounts and offers on Packt books and eBooks. 

Contributors

About the author

Anish Chapagain is a software engineer with a passion for data science, its processes, and Python programming, which began around 2007. He has been working with web scraping and analysis-related tasks for more than 5 years, and is currently pursuing freelance projects in the web scraping domain. Anish previously worked as a trainer, web/software developer, and as a banker, where he was exposed to data and gained further insights into topics including data analysis, visualization, data mining, information processing, and knowledge discovery. He has an MSc in computer systems from Bangor University (University of Wales), United Kingdom, and an Executive MBA from Himalayan Whitehouse International College, Kathmandu, Nepal.

About the reviewers

Radhika Datar has more than 5 years' experience in software development and content writing. She is well versed in frameworks such as Python, PHP, and Java, and regularly provides training on them. She has been working with Educba and Eduonix as a training consultant since June 2016, while also working as a freelance academic writer in data science and data analytics. She obtained her master's degree from the Symbiosis Institute of Computer Studies and Research and her bachelor's degree from K. J. Somaiya College of Science and Commerce.

 

 

Rohit Negi completed his bachelor of technology in computer science from Uttarakhand Technical University, Dehradun. His bachelor's curriculum included a specialization in computer science and applied engineering. Currently, he is working as a senior test consultant at Orbit Technologies and provides test automation solutions to LAM Research (USA clients). He has extensive quality assurance proficiency working with the following tools: Microsoft Azure VSTS, Selenium, Cucumber/BDD, MS SQL/MySQL, Java, and web scraping using Selenium. Additionally, he has a good working knowledge of how to automate workflows using Selenium, Protractor for AngularJS-based applications, Python for exploratory data analysis, and machine learning.

 

 

 

 

 

 

Packt is searching for authors like you

If you're interested in becoming an author for Packt, please visit authors.packtpub.com and apply today. We have worked with thousands of developers and tech professionals, just like you, to help them share their insight with the global tech community. You can make a general application, apply for a specific hot topic that we are recruiting an author for, or submit your own idea.

Table of Contents

Title Page

Copyright and Credits

Hands-On Web Scraping with Python

Dedication

About Packt

Why subscribe?

Contributors

About the author

About the reviewers

Packt is searching for authors like you

Preface

Who this book is for

What this book covers

To get the most out of this book

Download the example code files

Download the color images

Conventions used

Get in touch

Reviews

Section 1: Introduction to Web Scraping

Web Scraping Fundamentals

Introduction to web scraping

Understanding web development and technologies

HTTP

HTML 

HTML elements and attributes

Global attributes

XML

JavaScript

JSON

CSS

AngularJS

Data finding techniques for the web

HTML page source

Case 1

Case 2

Developer tools

Sitemaps

The robots.txt file

Summary

Further reading

Section 2: Beginning Web Scraping

Python and the Web – Using urllib and Requests

Technical requirements

Accessing the web with Python

Setting things up

Loading URLs

URL handling and operations with urllib and requests

urllib

requests

Implementing HTTP methods

GET

POST

Summary

Further reading

Using LXML, XPath, and CSS Selectors

Technical requirements

Introduction to XPath and CSS selector

XPath

CSS selectors

Element selectors

ID and class selectors

Attribute selectors

Pseudo selectors

Using web browser developer tools for accessing web content

HTML elements and DOM navigation

XPath and CSS selectors using DevTools

Scraping using lxml, a Python library

lxml by examples

Example 1 – reading XML from file and traversing through its elements

Example 2 – reading HTML documents using lxml.html

Example 3 – reading and parsing HTML for retrieving HTML form type element attributes

Web scraping using lxml

Example 1 – extracting selected data from a single page using lxml.html.xpath

Example 2 – looping with XPath and scraping data from multiple pages

Example 3 – using lxml.cssselect to scrape content from a page

Summary

Further reading

Scraping Using pyquery – a Python Library

Technical requirements

Introduction to pyquery

Exploring pyquery

Loading documents

Element traversing, attributes, and pseudo-classes

Iterating

Web scraping using pyquery

Example 1 – scraping data science announcements

Example 2 – scraping information from nested links

Example 3 – extracting AHL Playoff results

Example 4 – collecting URLs from sitemap.xml

Case 1 – using the HTML parser

Case 2 – using the XML parser

Summary

Further reading

Web Scraping Using Scrapy and Beautiful Soup

Technical requirements

Web scraping using Beautiful Soup

Introduction to Beautiful Soup

Exploring Beautiful Soup

Searching, traversing, and iterating

Using children and parents

Using next and previous

Using CSS Selectors

Example 1 – listing <li> elements with the data-id attribute 

Example 2 – traversing through elements

Example 3 – searching elements based on attribute values

Building a web crawler

Web scraping using Scrapy

Introduction to Scrapy

Setting up a project

Generating a Spider

Creating an item

Extracting data

Using XPath

Using CSS Selectors

Data from multiple pages

Running and exporting

Deploying a web crawler

Summary

Further reading

Section 3: Advanced Concepts

Working with Secure Web

Technical requirements

Introduction to secure web

Form processing

Cookies and sessions

Cookies

Sessions

User authentication

HTML <form> processing

Handling user authentication

Working with cookies and sessions

Summary

Further reading

Data Extraction Using Web-Based APIs

Technical requirements

Introduction to web APIs

REST and SOAP

REST 

SOAP 

Benefits of web APIs

Accessing web API and data formats

Making requests to the web API using a web browser

Case 1 – accessing a simple API (request and response)

Case 2 – demonstrating status codes and informative responses from the API

Case 3 – demonstrating RESTful API cache functionality

Web scraping using APIs

Example 1 – searching and collecting university names and URLs

Example 2 – scraping information from GitHub events

Summary

Further reading

Using Selenium to Scrape the Web

Technical requirements

Introduction to Selenium

Selenium projects

Selenium WebDriver

Selenium RC

Selenium Grid

Selenium IDE

Setting things up

Exploring Selenium

Accessing browser properties

Locating web elements

Using Selenium for web scraping

Example 1 – scraping product information

Example 2 – scraping book information

Summary

Further reading

Using Regex to Extract Data

Technical requirements

Overview of regular expressions

Regular expressions and Python

Using regular expressions to extract data

Example 1 – extracting HTML-based content

Example 2 – extracting dealer locations

Example 3 – extracting XML content

Summary

Further reading

Section 4: Conclusion

Next Steps

Technical requirements

Managing scraped data

Writing to files

Analysis and visualization using pandas and matplotlib

Machine learning 

ML and AI

Python and ML

Types of ML algorithms

Supervised learning

Classification

Regression

Unsupervised learning

Association

Clustering

Reinforcement learning

Data mining 

Tasks of data mining

Predictive

Classification

Regression

Prediction 

Descriptive

Clustering

Summarization

Association rules

What's next?

Summary 

Further reading

Other Books You May Enjoy

Leave a review - let other readers know what you think

Preface

Web scraping is an essential technique used in many organizations to scrape valuable data from web pages. Web scraping, or web harvesting, is done with a view to extracting and collecting data from websites. Web scraping comes in handy with model development, which requires data to be collected on the fly. It is also applicable for the data that is true and relevant to the topic, in which the accuracy is desired over the short-term, as opposed to implementing datasets. Data collected is stored in files including JSON, CSV, and XML, is also written a the database for later use, and is also made available online as datasets. This book will open the gates for you in terms of delving deep into web scraping techniques and methodologies using Python libraries and other popular tools, such as Selenium. By the end of this book, you will have learned how to efficiently scrape different websites.

Who this book is for

This book is intended for Python programmers, data analysts, web scraping newbies, and anyone who wants to learn how to perform web scraping from scratch. If you want to begin your journey in applying web scraping techniques to a range of web pages, then this book is what you need!

What this book covers

Chapter 1, Web Scraping Fundamentals, explores some core technologies and tools that are relevant to WWW and that are required for web scraping.

Chapter 2, Python and the Web – Using URLlib and Requests, demonstrates some of the core features available through the Python libraries such as requests and urllib, in addition to exploring page contents in various formats and structures. 

Chapter 3, Using LXML, XPath, and CSS Selectors, describes various examples using LXML, implementing a variety of techniques and library features to deal with elements and ElementTree. 

Chapter 4, Scraping Using pyquery – a Python Library, goes into more detail regarding web scraping techniques and a number of new Python libraries that deploy these techniques.

Chapter 5, Web Scraping Using Scrapy and Beautiful Soup, examines various aspects of traversing web documents using Beautiful Soup, while also exploring a framework that was built for crawling activities using spiders, in other words, Scrapy.

Chapter 6, Working with Secure Web, covers a number of basic security-related measures and techniques that are often encountered and that pose a challenge to web scraping.

Chapter 7, Data Extraction Using Web-Based APIs, covers the Python programming language and how to interact with the web APIs with regard to data extraction.

Chapter 8, Using Selenium to Scrape the Web, covers Selenium and how to use it to scrape data from the web.

Chapter 9, Using Regex to Extract Data, goes into more detail regarding web scraping techniques using regular expressions.

Chapter 10, Next Steps, introduces and examines basic concepts regarding data management using files, and analysis and visualization using pandas and matplotlib, while also providing an introduction to machine learning and data mining and exploring a number of related resources that can be helpful in terms of further learning and career development. 

To get the most out of this book

Readers should have some working knowledge of the Python programming language.

Download the example code files

You can download the example code files for this book from your account at www.packt.com. If you purchased this book elsewhere, you can visit www.packt.com/support and register to have the files emailed directly to you.

You can download the code files by following these steps:

Log in or register at

 

www.packt.com

.

Select the

 

SUPPORT

 

tab.

Click on

 

Code Downloads & Errata

.

Enter the name of the book in the

 

Search

 

box and follow the onscreen instructions.

Once the file is downloaded, please make sure that you unzip or extract the folder using the latest version of:

WinRAR/7-Zip for Windows

Zipeg/iZip/UnRarX for Mac

7-Zip/PeaZip for Linux

The code bundle for the book is also hosted on GitHub at https://github.com/PacktPublishing/Hands-On-Web-Scraping-with-Python. In case there's an update to the code, it will be updated on the existing GitHub repository.

We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/. Check them out!

Download the color images

We also provide a PDF file that has color images of the screenshots/diagrams used in this book. You can download it here: https://www.packtpub.com/sites/default/files/downloads/9781789533392_ColorImages.pdf.

Conventions used

There are a number of text conventions used throughout this book.

CodeInText: Indicates code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles. Here is an example: "The <p> and <h1> HTML elements contain general text information (element content) with them."

A block of code is set as follows:

import requestslink="http://localhost:8080/~cache"queries= {'id':'123456','display':'yes'}addedheaders={'user-agent':''}

When we wish to draw your attention to a particular part of a code block, the relevant lines or items are set in bold:

import requestslink="http://localhost:8080/~cache"queries= {'id':'123456','display':'yes'}

addedheaders={'

user-agent

':''}

Any command-line input or output is written as follows:

C:\> pip --version

pip 18.1 from c:\python37\lib\site-packages\pip (python 3.7)

Bold: Indicates a new term, an important word, or words that you see on screen. For example, words in menus or dialog boxes appear in the text like this. Here is an example: "If accessing Developer tools through the Chrome menu, click More tools | Developer tools"

Warnings or important notes appear like this.
Tips and tricks appear like this.

Get in touch

Feedback from our readers is always welcome.

General feedback: If you have questions about any aspect of this book, mention the book title in the subject of your message and email us at [email protected].

Errata: Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you have found a mistake in this book, we would be grateful if you would report this to us. Please visit www.packt.com/submit-errata, selecting your book, clicking on the Errata Submission Form link, and entering the details.

Piracy: If you come across any illegal copies of our works in any form on the internet, we would be grateful if you would provide us with the location address or website name. Please contact us at [email protected] with a link to the material.

If you are interested in becoming an author: If there is a topic that you have expertise in, and you are interested in either writing or contributing to a book, please visit authors.packtpub.com.

Reviews

Please leave a review. Once you have read and used this book, why not leave a review on the site that you purchased it from? Potential readers can then see and use your unbiased opinion to make purchase decisions, we at Packt can understand what you think about our products, and our authors can see your feedback on their book. Thank you!

For more information about Packt, please visit packt.com.

Section 1: Introduction to Web Scraping

In this section, you will be given an overview of web scraping (scraping requirements, the importance of data), web contents (patterns and layouts), Python programming and libraries (the basics and advanced), and data managing techniques (file handling and databases).

This section consists of the following chapter: 

Chapter 1

Web Scraping Fundamentals

Web Scraping Fundamentals

In this chapter, we will learn about and explore certain fundamental concepts related to web scraping and web-based technologies, assuming that you have no prior experience of web scraping. 

So, to start with, let's begin by asking a number of questions: 

Why 

is

 

there a growing need or demand for data? 

H

ow are we going to manage and fulfill the requirement for data with resources from the

World Wide Web

(

WWW

)?

Web scraping addresses both these questions, as it provides various tools and technologies that can be deployed to extract data or assist with information retrieval. Whether its web-based structured or unstructured data, we can use the web scraping process to extract data and use it for research, analysis, personal collections, information extraction, knowledge discovery, and many more purposes.

We will learn general techniques that are deployed to find data from the web and explore those techniques in depth using the Python programming language in the chapters ahead.

In this chapter, we will cover the following topics:

Introduction to web scraping

Understanding web development and technologies

Data finding techniques

Introduction to web scraping

Scraping is the process of extracting, copying, screening, or collecting data. Scraping or extracting data from the web (commonly known as websites or web pages, or internet-related resources) is normally termed web scraping.

Web scraping is a process of data extraction from the web that is suitable for certain requirements. Data collection and analysis, and its involvement in information and decision making, plus research-related activities, make the scraping process sensitive for all types of industry.

The popularity of the internet and its resources is causing information domains to evolve every day, which is also causing a growing demand for raw data. Data is the basic requirement in the fields of science, technology, and management. Collected or organized data is processed with varying degrees of logic to obtain information and gain further insights.

Web scraping provides the tools and techniques used to collect data from websites as appropriate for either personal or business-related needs, but with a number of legal considerations. 

There are a number of legal factors to consider before performing scraping tasks. Most websites contain pages such as Privacy Policy, About Us, and Terms and Conditions, where legal terms, prohibited content policies, and general information are available. It's a developer's ethical duty to follow those policies before planning any crawling and scraping activities from websites.

Scraping and crawling are both used quite interchangeably throughout the chapters in this book. Crawling, also known as spidering, is a process used to browse through the links on websites and is often used by search engines for indexing purposes, whereas scraping is mostly related to content extraction from websites. 

Understanding web development and technologies

A web page is not only a document container. Today's rapid developments in computing and web technologies have transformed the web into a dynamic and real-time source of information.

At our end, we (the users) use web browsers (such as Google Chrome, Firefox Mozilla, Internet Explorer, and Safari) to access information from the web. Web browsers provide various document-based functionalities to users and contain application-level features that are often useful to web developers.

Web pages that users view or explore through their browsers are not only single documents. Various technologies exist that can be used to develop websites or web pages. A web page is a document that contains blocks of HTML tags. Most of the time, it is built with various sub-blocks linked as dependent or independent components from various interlinked technologies, including JavaScript and CSS. 

An understanding of the general concepts of web pages and the techniques of web development, along with the technologies found inside web pages, will provide more flexibility and control in the scraping process. A lot of the time, a developer can also employ reverse engineering techniques.

Reverse engineering is an activity that involves breaking down and examining the concepts that were required to build certain products. For more information on reverse engineering, please refer to the GlobalSpec article, How Does Reverse Engineering Work?, available at https://insights.globalspec.com/article/7367/how-does-reverse-engineering-work.

Here, we will introduce and explore a few of the techniques that can help and guide us in the process of data extraction.

HTTP

Hyper Text Transfer Protocol (HTTP) is an application protocol that transfers resources such as HTML documents between a client and a web server. HTTP is a stateless protocol that follows the client-server model. Clients (web browsers) and web servers communicate or exchange information using HTTP Requests and HTTP Responses:

HTTP (client-server communication)

With HTTP requests or HTTP methods, a client or browser submits requests to the server. There are various methods (also known as HTTP request methods) for submitting requests, such as GET, POST, and PUT:

GET

: This is a c

ommon method for requesting information. It is considered a safe method, as the resource state is not altered. Also, it is used to provide query strings such as 

http://www.test-domain.com/

,

 

requesting information from servers based on the 

id

and

display

parameters sent with the request.

POST

: This is used to make a

 secure request to a server. The requested resource state

can

be altered. Data posted or sent to the requested URL is not visible in the URL, but rather transferred with the request body. It's used to submit information to the server in secure way, such as for login and user registration.

Using the browser developer tools shown in the following screenshot, the Request Method can be revealed, along with other HTTP-related information:

General HTTP headers (accessed using the browser developer tools)

We will explore more about HTTP methods in Chapter 2,Python and the Web – Using urllib and Requests, in the Implementing HTTP methods section.

HTTP headers pass additional information to a client or server while performing a request or response. Headers are generally name-value pairs of information transferred between a client and a server during their communication, and are generally grouped into request and response headers:

Request Headers

: These are headers that is used for making requests. Information such as language and encoding requests

-*

, that is referrers, cookies, browser-related information, and so on, is provided to the server while making the request. The following screenshot displays the

Request Headers

obtained from browser developer tools while making a request to

https://www.python.org

:

Request headers (accessed using the browser developer tools)

Response Headers

: These headers contain information about the server's response. Information regarding the response (including size, type, and date) and the 

server

status

 is generally found in

Response Headers

. The following s

creenshot displays the

Response Headers

obtained from the browser developer tools after making a request to 

https://www.python.org

:

Response headers (accessed using the browser developer tools)

The information seen in the previous screenshots was captured during the request made to https://www.python.org. 

HTTP Requests can also be provided with the required HTTP Headers while making requests to the server. Information related to the request URL, request method, status code, request headers, query string parameters, cookies, POST parameters, and server details can generally be explored using HTTP Headers information.

With HTTP responses, the server processes the requests, and sometimes the specified HTTP headers, that are sent to it. When requests are received and processed, it returns its response to the browser.

A response contains status codes, the meaning of which can be revealed using developer tools, as seen in the previous screenshots. The following list contains a few status codes along with some brief information:

200 (OK, request succeeded)

404 (Not found; requested resource cannot be found)

500 (Internal server error)

204 (No content to be sent)

401 (Unauthorized request was made to the server)

For more information on HTTP, HTTP responses, and status codes, please consult the official documentation at https://www.w3.org/Protocols/ and https://developer.mozilla.org/en-US/docs/Web/HTTP/Status.

HTTP cookies are data sent by server to the browser. Cookies are data that's generated and stored by websites on your system or computer. Data in cookies helps to identify HTTP requests from the user to the website. Cookies contain information regarding session management, user preferences, and user behavior.

The server identifies and communicates with the browser based on the information stored in the cookie. Data stored in cookies helps a website to access and transfer certain saved values such as session ID, expiration date and time, and so on, providing quick interaction between the web request and the response:

Cookies set by a website (accessed using the browser developer tools)
For more information on cookies, please visit AboutCookies at http://www.allaboutcookies.org/, and allaboutcookies at http://www.allaboutcookies.org/.

With HTTP proxies, a proxy server acts as an intermediate server between a client and the main web server. The web browser sends requests to the server that are actually passed through the proxy, and the proxy returns the response from the server to the client.

Proxies are often used for monitoring/filtering, performance improvement, translation, and security for internet-related resources. Proxies can also be bought as a service, which may also be used to deal with cross-domain resources. There are also various forms of proxy implementation, such as web proxies (which can be used to bypass IP blocking), CGI proxies, and DNS proxies.

Cookie-based parameters that are passed in using GETrequests, HTML form-relatedPOSTrequests, and modifying or adapting headers will be crucial in managing code (that is, scripts) and accessing content during the web scraping process.

Details on HTTP, headers, cookies, and so on will be explored more in the upcoming Data finding techniques for the web section. Please visit MDN web docs-HTTP (https://developer.mozilla.org/en-US/docs/Web/HTTP) for more detailed information on HTTP.

HTML 

Websites are made up of pages or documents containing text, images, style sheets, and scripts, among other things. They are often built with markup languages such as Hypertext Markup Language (HTML) and Extensible Hypertext Markup Language (XHTML). 

HTML is often termed as the standard markup language used for building a web page. Since the early 1990s, HTML has been used independently, as well as in conjunction with server-based scripting languages such as PHP, ASP, and JSP.

XHTML is an advanced and extended version of HTML, which is the primary markup language for web documents. XHTML is also stricter than HTML, and from the coding perspective, is an XML application. 

HTML defines and contains the contents of a web page. Data that can be extracted, and any information-revealing data sources can be found inside HTML pages within a predefined instruction set or markup elements called tags. HTML tags are normally a named placeholder carrying certain predefined attributes.

Global attributes

HTML elements can contain some additional information, such as key/value pairs. These are also known as HTML element attributes. Attributes holds values and provide identification, or contain additional information that can be helpful in many aspects during scraping activities such as identifying exact web elements and extracting values or text from them, traversing through elements and more.

There are certain attributes that are common to HTML elements or can be applied to all HTML elements as follows. These attributes are identified as global attributes (https://developer.mozilla.org/en-US/docs/Web/HTML/Global_attributes):

id

class

style

lang

 

HTML elements attributes such as id and class are mostly used to identify or format individual elements, or groups of elements. These attributes can also be managed by CSS and other scripting languages. 

id attribute values should be unique to the element they're applied to. class attribute values are mostly used with CSS, providing equal state formatting options, and can be used with multiple elements.

Attributes such as id and class are identified by placing # and . respectively in front of the attribute name when used with CSS, traversing, and parsing techniques.

HTML element attributes can also be overwritten or implemented dynamically using scripting languages.

As displayed in following examples, itemprop attributes are used to add properties to an element, whereas data-* is used to store data that is native to the element itself:

<div itemscope itemtype ="http://schema.org/Place"> <h1 itemprop="univeristy">University of Helsinki</h1> <span>Subject: <span itemprop="subject1">Artificial Intelligence</span> </span> <span itemprop="subject2">Data Science</span></div><img class="dept" src="logo.png" data-course-id="324" data-title="Predictive Aanalysis" data-x="12345" data-y="54321" data-z="56743" onclick="schedule.load()"></img>

HTML tags and attributes are a major source of data when it comes to extraction.

Please visit https://www.w3.org/html/ and https://www.w3schools.com/html/ for more information on HTML.

In the chapters ahead, we will explore these attributes using different tools. We will also perform various logical operations and use them to extract content.

XML

Extensible Markup Language (XML) is a markup language used for distributing data over the internet, with a set of rules for encoding documents that are readable and easily exchangeable between machines and documents. 

XML can use textual data across various formats and systems. XML is designed to carry portable data or data stored in tags that is not predefined with HTML tags. In XML documents, tags are created by the document developer or an automated program to describe the content they are carrying. 

The following code displays some example XML content. The <employees> parent node has three <employee> child nodes, which in turn contain the other child nodes <firstName>, <lastName>, and <gender>:

<employees> <employee> <firstName>Rahul</firstName> <lastName>Reddy</lastName> <gender>Male</gender> </employee> <employee> <firstName>Aasira</firstName> <lastName>Chapagain</lastName> <gender>Female</gender> </employee> <employee> <firstName>Peter</firstName> <lastName>Lara</lastName> <gender>Male</gender> </employee></employees>

XML is an open standard, using the Unicode character set. XML is used for sharing data across various platforms and has been adopted by various web applications. Many websites use XML data, implementing its contents with the use of scripting languages and presenting it in HTML or other document formats for the end user to view.

Extraction tasks from XML documents can also be performed to obtain the contents in the desired format, or by filtering the requirement with respect to a specific need for data. Plus, behind-the-scenes data may also be obtained from certain websites only.

Please visit https://www.w3.org/XML/ and https://www.w3schools.com/xml/ for more information on XML. 

JSON

JavaScript Object Notation (JSON) is a format used for storing and transporting data from a server to a web page. It is language independent and is popular in web-based data-interchange actions due to its size and readability.

JSON data is normally a name/value pair that is evaluated as a JavaScript object and follows JavaScript operations. JSON and XML are often compared, as they both carry and exchange data between various web resources. JSON is also ranked higher than XML for its structure, which is simple, readable, self-descriptive, understandable, and easy to process. For web applications using JavaScript, AJAX, or RESTful services, JSON is preferred over XML due to its fast and easy operation. 

JSON and JavaScript objects are interchangeable. JSON is not a markup language and it doesn't contain any tags or attributes. Instead, it is a text-only format that can be sent to/accessed through a server, as well as being managed by any programming language. JSON objects can also be expressed as arrays, dictionary, and lists as seen in the following code:

{"mymembers":[ { "firstName":"Aasira", "lastName":"Chapagain","cityName":"Kathmandu"}, { "firstName":"Rakshya", "lastName":"Dhungel","cityName":"New Delhi"}, { "firstName":"Shiba", "lastName":"Paudel","cityName":"Biratnagar"}, { "firstName":"Rahul", "lastName":"Reddy","cityName":"New Delhi"}, { "firstName":"Peter", "lastName":"Lara","cityName":"Trinidad"}]}

JSON Lines: This is a JSON-like format where each line of a record is a valid JSON value. It is also known as newline-delimited JSON, that is, individual JSON records separated by newline (\n) characters. JSON Lines formatting can be very useful when dealing with a large volume of data. 

Data sources in the JSON or JSON Lines formats are preferred to XML because of the easy data pattern and code readability, which can also be managed with minimum programming effort:

{"firstName":"Aasira", "lastName":"Chapagain","cityName":"Kathmandu"} {"firstName":"Rakshya", "lastName":"Dhungel","cityName":"New Delhi"} {"firstName":"Shiba", "lastName":"Paudel","cityName":"Biratnagar"} {"firstName":"Rahul", "lastName":"Reddy","cityName":"New Delhi"} {"firstName":"Peter", "lastName":"Lara","cityName":"Trinidad"}

From the perspective of data extraction, because of the lightweight and simple structure of the JSON format, web pages use JSON content with their scripting technologies to add dynamic features. 

Please visit http://www.json.org/, http://jsonlines.org/, and https://www.w3schools.com/js/js_json_intro.asp for more information regarding JSON and JSON Lines.

CSS

The web-based technologies we have introduced so far deal with content, content binding, content development, and processing. Cascading Style Sheets (CSS) describes the display properties of HTML elements and the appearance of web pages. CSS is used for styling and providing the desired appearance and presentation of HTML elements.

Developers/designers can control the layout and presentation of a web document using CSS. CSS can be applied to a distinct element in a page, or it can be embedded through a separate document. Styling details can be described using the <style> tag.

The <style> tag can contain details targeting repeated and various elements in a block. As seen in the following code, multiple <a>elements exist and also possess the class and id global attributes: 

<html><head> <style> a{color:blue;} h1{color:black; text-decoration:underline;} #idOne{color:red;} .classOne{color:orange;} </style></head><body> <h1> Welcome to Web Scraping </h1> Links: <a href="https://www.google.com"> Google </a> <a class='classOne' href="https://www.yahoo.com"> Yahoo </a> <a id='idOne' href="https://www.wikipedia.org"> Wikipedia </a></body></html>

Attributes that are provided with CSS properties or have been styled inside <style> tags in the preceding code block will result in the output seen here:

HTML output (with the elements styled using CSS)

CSS properties can also appear in in-line structure with each particular element. In-line CSS properties override external CSS styles. The CSS colorproperty has been applied in-line to elements. This will override the color value defined inside <style>:

<h1 style ='color:orange;'> Welcome to Web Scraping </h1> Links: