35,99 €
Web scraping is an essential technique used in many organizations to gather valuable data from web pages. This book will enable you to delve into web scraping techniques and methodologies.
The book will introduce you to the fundamental concepts of web scraping techniques and how they can be applied to multiple sets of web pages. You'll use powerful libraries from the Python ecosystem such as Scrapy, lxml, pyquery, and bs4 to carry out web scraping operations. You will then get up to speed with simple to intermediate scraping operations such as identifying information from web pages and using patterns or attributes to retrieve information. This book adopts a practical approach to web scraping concepts and tools, guiding you through a series of use cases and showing you how to use the best tools and techniques to efficiently scrape web pages. You'll even cover the use of other popular web scraping tools, such as Selenium, Regex, and web-based APIs.
By the end of this book, you will have learned how to efficiently scrape the web using different techniques with Python and other popular tools.
Das E-Book können Sie in Legimi-Apps oder einer beliebigen App lesen, die das folgende Format unterstützen:
Seitenzahl: 336
Veröffentlichungsjahr: 2019
Copyright © 2019 Packt Publishing
All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.
Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing or its dealers and distributors, will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book.
Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.
Commissioning Editor: Sunith ShettyAcquisition Editor: Aniruddha PatilContent Development Editor: Roshan KumarSenior Editor: Ayaan HodaTechnical Editor:Sushmeeta JenaCopy Editor: Safis EditingProject Coordinator:Namrata SwettaProofreader: Safis EditingIndexer: Tejal Daruwale SoniProduction Designer: Alishon Mendonsa
First published: June 2019
Production reference: 2120619
Published by Packt Publishing Ltd. Livery Place 35 Livery Street Birmingham B3 2PB, UK.
ISBN 978-1-78953-339-2
www.packtpub.com
Packt.com
Subscribe to our online digital library for full access to over 7,000 books and videos, as well as industry leading tools to help you plan your personal development and advance your career. For more information, please visit our website.
Spend less time learning and more time coding with practical eBooks and Videos from over 4,000 industry professionals
Improve your learning with Skill Plans built especially for you
Get a free eBook or video every month
Fully searchable for easy access to vital information
Copy and paste, print, and bookmark content
Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.packt.com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at [email protected] for more details.
At www.packt.com, you can also read a collection of free technical articles, sign up for a range of free newsletters, and receive exclusive discounts and offers on Packt books and eBooks.
Anish Chapagain is a software engineer with a passion for data science, its processes, and Python programming, which began around 2007. He has been working with web scraping and analysis-related tasks for more than 5 years, and is currently pursuing freelance projects in the web scraping domain. Anish previously worked as a trainer, web/software developer, and as a banker, where he was exposed to data and gained further insights into topics including data analysis, visualization, data mining, information processing, and knowledge discovery. He has an MSc in computer systems from Bangor University (University of Wales), United Kingdom, and an Executive MBA from Himalayan Whitehouse International College, Kathmandu, Nepal.
Radhika Datar has more than 5 years' experience in software development and content writing. She is well versed in frameworks such as Python, PHP, and Java, and regularly provides training on them. She has been working with Educba and Eduonix as a training consultant since June 2016, while also working as a freelance academic writer in data science and data analytics. She obtained her master's degree from the Symbiosis Institute of Computer Studies and Research and her bachelor's degree from K. J. Somaiya College of Science and Commerce.
Rohit Negi completed his bachelor of technology in computer science from Uttarakhand Technical University, Dehradun. His bachelor's curriculum included a specialization in computer science and applied engineering. Currently, he is working as a senior test consultant at Orbit Technologies and provides test automation solutions to LAM Research (USA clients). He has extensive quality assurance proficiency working with the following tools: Microsoft Azure VSTS, Selenium, Cucumber/BDD, MS SQL/MySQL, Java, and web scraping using Selenium. Additionally, he has a good working knowledge of how to automate workflows using Selenium, Protractor for AngularJS-based applications, Python for exploratory data analysis, and machine learning.
If you're interested in becoming an author for Packt, please visit authors.packtpub.com and apply today. We have worked with thousands of developers and tech professionals, just like you, to help them share their insight with the global tech community. You can make a general application, apply for a specific hot topic that we are recruiting an author for, or submit your own idea.
Title Page
Copyright and Credits
Hands-On Web Scraping with Python
Dedication
About Packt
Why subscribe?
Contributors
About the author
About the reviewers
Packt is searching for authors like you
Preface
Who this book is for
What this book covers
To get the most out of this book
Download the example code files
Download the color images
Conventions used
Get in touch
Reviews
Section 1: Introduction to Web Scraping
Web Scraping Fundamentals
Introduction to web scraping
Understanding web development and technologies
HTTP
HTML 
HTML elements and attributes
Global attributes
XML
JavaScript
JSON
CSS
AngularJS
Data finding techniques for the web
HTML page source
Case 1
Case 2
Developer tools
Sitemaps
The robots.txt file
Summary
Further reading
Section 2: Beginning Web Scraping
Python and the Web – Using urllib and Requests
Technical requirements
Accessing the web with Python
Setting things up
Loading URLs
URL handling and operations with urllib and requests
urllib
requests
Implementing HTTP methods
GET
POST
Summary
Further reading
Using LXML, XPath, and CSS Selectors
Technical requirements
Introduction to XPath and CSS selector
XPath
CSS selectors
Element selectors
ID and class selectors
Attribute selectors
Pseudo selectors
Using web browser developer tools for accessing web content
HTML elements and DOM navigation
XPath and CSS selectors using DevTools
Scraping using lxml, a Python library
lxml by examples
Example 1 – reading XML from file and traversing through its elements
Example 2 – reading HTML documents using lxml.html
Example 3 – reading and parsing HTML for retrieving HTML form type element attributes
Web scraping using lxml
Example 1 – extracting selected data from a single page using lxml.html.xpath
Example 2 – looping with XPath and scraping data from multiple pages
Example 3 – using lxml.cssselect to scrape content from a page
Summary
Further reading
Scraping Using pyquery – a Python Library
Technical requirements
Introduction to pyquery
Exploring pyquery
Loading documents
Element traversing, attributes, and pseudo-classes
Iterating
Web scraping using pyquery
Example 1 – scraping data science announcements
Example 2 – scraping information from nested links
Example 3 – extracting AHL Playoff results
Example 4 – collecting URLs from sitemap.xml
Case 1 – using the HTML parser
Case 2 – using the XML parser
Summary
Further reading
Web Scraping Using Scrapy and Beautiful Soup
Technical requirements
Web scraping using Beautiful Soup
Introduction to Beautiful Soup
Exploring Beautiful Soup
Searching, traversing, and iterating
Using children and parents
Using next and previous
Using CSS Selectors
Example 1 – listing <li> elements with the data-id attribute 
Example 2 – traversing through elements
Example 3 – searching elements based on attribute values
Building a web crawler
Web scraping using Scrapy
Introduction to Scrapy
Setting up a project
Generating a Spider
Creating an item
Extracting data
Using XPath
Using CSS Selectors
Data from multiple pages
Running and exporting
Deploying a web crawler
Summary
Further reading
Section 3: Advanced Concepts
Working with Secure Web
Technical requirements
Introduction to secure web
Form processing
Cookies and sessions
Cookies
Sessions
User authentication
HTML <form> processing
Handling user authentication
Working with cookies and sessions
Summary
Further reading
Data Extraction Using Web-Based APIs
Technical requirements
Introduction to web APIs
REST and SOAP
REST 
SOAP 
Benefits of web APIs
Accessing web API and data formats
Making requests to the web API using a web browser
Case 1 – accessing a simple API (request and response)
Case 2 – demonstrating status codes and informative responses from the API
Case 3 – demonstrating RESTful API cache functionality
Web scraping using APIs
Example 1 – searching and collecting university names and URLs
Example 2 – scraping information from GitHub events
Summary
Further reading
Using Selenium to Scrape the Web
Technical requirements
Introduction to Selenium
Selenium projects
Selenium WebDriver
Selenium RC
Selenium Grid
Selenium IDE
Setting things up
Exploring Selenium
Accessing browser properties
Locating web elements
Using Selenium for web scraping
Example 1 – scraping product information
Example 2 – scraping book information
Summary
Further reading
Using Regex to Extract Data
Technical requirements
Overview of regular expressions
Regular expressions and Python
Using regular expressions to extract data
Example 1 – extracting HTML-based content
Example 2 – extracting dealer locations
Example 3 – extracting XML content
Summary
Further reading
Section 4: Conclusion
Next Steps
Technical requirements
Managing scraped data
Writing to files
Analysis and visualization using pandas and matplotlib
Machine learning 
ML and AI
Python and ML
Types of ML algorithms
Supervised learning
Classification
Regression
Unsupervised learning
Association
Clustering
Reinforcement learning
Data mining 
Tasks of data mining
Predictive
Classification
Regression
Prediction 
Descriptive
Clustering
Summarization
Association rules
What's next?
Summary 
Further reading
Other Books You May Enjoy
Leave a review - let other readers know what you think
Web scraping is an essential technique used in many organizations to scrape valuable data from web pages. Web scraping, or web harvesting, is done with a view to extracting and collecting data from websites. Web scraping comes in handy with model development, which requires data to be collected on the fly. It is also applicable for the data that is true and relevant to the topic, in which the accuracy is desired over the short-term, as opposed to implementing datasets. Data collected is stored in files including JSON, CSV, and XML, is also written a the database for later use, and is also made available online as datasets. This book will open the gates for you in terms of delving deep into web scraping techniques and methodologies using Python libraries and other popular tools, such as Selenium. By the end of this book, you will have learned how to efficiently scrape different websites.
This book is intended for Python programmers, data analysts, web scraping newbies, and anyone who wants to learn how to perform web scraping from scratch. If you want to begin your journey in applying web scraping techniques to a range of web pages, then this book is what you need!
Chapter 1, Web Scraping Fundamentals, explores some core technologies and tools that are relevant to WWW and that are required for web scraping.
Chapter 2, Python and the Web – Using URLlib and Requests, demonstrates some of the core features available through the Python libraries such as requests and urllib, in addition to exploring page contents in various formats and structures.
Chapter 3, Using LXML, XPath, and CSS Selectors, describes various examples using LXML, implementing a variety of techniques and library features to deal with elements and ElementTree.
Chapter 4, Scraping Using pyquery – a Python Library, goes into more detail regarding web scraping techniques and a number of new Python libraries that deploy these techniques.
Chapter 5, Web Scraping Using Scrapy and Beautiful Soup, examines various aspects of traversing web documents using Beautiful Soup, while also exploring a framework that was built for crawling activities using spiders, in other words, Scrapy.
Chapter 6, Working with Secure Web, covers a number of basic security-related measures and techniques that are often encountered and that pose a challenge to web scraping.
Chapter 7, Data Extraction Using Web-Based APIs, covers the Python programming language and how to interact with the web APIs with regard to data extraction.
Chapter 8, Using Selenium to Scrape the Web, covers Selenium and how to use it to scrape data from the web.
Chapter 9, Using Regex to Extract Data, goes into more detail regarding web scraping techniques using regular expressions.
Chapter 10, Next Steps, introduces and examines basic concepts regarding data management using files, and analysis and visualization using pandas and matplotlib, while also providing an introduction to machine learning and data mining and exploring a number of related resources that can be helpful in terms of further learning and career development.
Readers should have some working knowledge of the Python programming language.
You can download the example code files for this book from your account at www.packt.com. If you purchased this book elsewhere, you can visit www.packt.com/support and register to have the files emailed directly to you.
You can download the code files by following these steps:
Log in or register at
www.packt.com
.
Select the
SUPPORT
tab.
Click on
Code Downloads & Errata
.
Enter the name of the book in the
Search
box and follow the onscreen instructions.
Once the file is downloaded, please make sure that you unzip or extract the folder using the latest version of:
WinRAR/7-Zip for Windows
Zipeg/iZip/UnRarX for Mac
7-Zip/PeaZip for Linux
The code bundle for the book is also hosted on GitHub at https://github.com/PacktPublishing/Hands-On-Web-Scraping-with-Python. In case there's an update to the code, it will be updated on the existing GitHub repository.
We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/. Check them out!
We also provide a PDF file that has color images of the screenshots/diagrams used in this book. You can download it here: https://www.packtpub.com/sites/default/files/downloads/9781789533392_ColorImages.pdf.
There are a number of text conventions used throughout this book.
CodeInText: Indicates code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles. Here is an example: "The <p> and <h1> HTML elements contain general text information (element content) with them."
A block of code is set as follows:
import requestslink="http://localhost:8080/~cache"queries= {'id':'123456','display':'yes'}addedheaders={'user-agent':''}
When we wish to draw your attention to a particular part of a code block, the relevant lines or items are set in bold:
import requestslink="http://localhost:8080/~cache"queries= {'id':'123456','display':'yes'}
addedheaders={'
user-agent
':''}
Any command-line input or output is written as follows:
C:\> pip --version
pip 18.1 from c:\python37\lib\site-packages\pip (python 3.7)
Bold: Indicates a new term, an important word, or words that you see on screen. For example, words in menus or dialog boxes appear in the text like this. Here is an example: "If accessing Developer tools through the Chrome menu, click More tools | Developer tools"
Feedback from our readers is always welcome.
General feedback: If you have questions about any aspect of this book, mention the book title in the subject of your message and email us at [email protected].
Errata: Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you have found a mistake in this book, we would be grateful if you would report this to us. Please visit www.packt.com/submit-errata, selecting your book, clicking on the Errata Submission Form link, and entering the details.
Piracy: If you come across any illegal copies of our works in any form on the internet, we would be grateful if you would provide us with the location address or website name. Please contact us at [email protected] with a link to the material.
If you are interested in becoming an author: If there is a topic that you have expertise in, and you are interested in either writing or contributing to a book, please visit authors.packtpub.com.
Please leave a review. Once you have read and used this book, why not leave a review on the site that you purchased it from? Potential readers can then see and use your unbiased opinion to make purchase decisions, we at Packt can understand what you think about our products, and our authors can see your feedback on their book. Thank you!
For more information about Packt, please visit packt.com.
In this section, you will be given an overview of web scraping (scraping requirements, the importance of data), web contents (patterns and layouts), Python programming and libraries (the basics and advanced), and data managing techniques (file handling and databases).
This section consists of the following chapter:
Chapter 1
,
Web Scraping Fundamentals
In this chapter, we will learn about and explore certain fundamental concepts related to web scraping and web-based technologies, assuming that you have no prior experience of web scraping.
So, to start with, let's begin by asking a number of questions:
Why
is
there a growing need or demand for data?
H
ow are we going to manage and fulfill the requirement for data with resources from the
World Wide Web
(
WWW
)?
Web scraping addresses both these questions, as it provides various tools and technologies that can be deployed to extract data or assist with information retrieval. Whether its web-based structured or unstructured data, we can use the web scraping process to extract data and use it for research, analysis, personal collections, information extraction, knowledge discovery, and many more purposes.
We will learn general techniques that are deployed to find data from the web and explore those techniques in depth using the Python programming language in the chapters ahead.
In this chapter, we will cover the following topics:
Introduction to web scraping
Understanding web development and technologies
Data finding techniques
Scraping is the process of extracting, copying, screening, or collecting data. Scraping or extracting data from the web (commonly known as websites or web pages, or internet-related resources) is normally termed web scraping.
Web scraping is a process of data extraction from the web that is suitable for certain requirements. Data collection and analysis, and its involvement in information and decision making, plus research-related activities, make the scraping process sensitive for all types of industry.
The popularity of the internet and its resources is causing information domains to evolve every day, which is also causing a growing demand for raw data. Data is the basic requirement in the fields of science, technology, and management. Collected or organized data is processed with varying degrees of logic to obtain information and gain further insights.
Web scraping provides the tools and techniques used to collect data from websites as appropriate for either personal or business-related needs, but with a number of legal considerations.
There are a number of legal factors to consider before performing scraping tasks. Most websites contain pages such as Privacy Policy, About Us, and Terms and Conditions, where legal terms, prohibited content policies, and general information are available. It's a developer's ethical duty to follow those policies before planning any crawling and scraping activities from websites.
A web page is not only a document container. Today's rapid developments in computing and web technologies have transformed the web into a dynamic and real-time source of information.
At our end, we (the users) use web browsers (such as Google Chrome, Firefox Mozilla, Internet Explorer, and Safari) to access information from the web. Web browsers provide various document-based functionalities to users and contain application-level features that are often useful to web developers.
Web pages that users view or explore through their browsers are not only single documents. Various technologies exist that can be used to develop websites or web pages. A web page is a document that contains blocks of HTML tags. Most of the time, it is built with various sub-blocks linked as dependent or independent components from various interlinked technologies, including JavaScript and CSS.
An understanding of the general concepts of web pages and the techniques of web development, along with the technologies found inside web pages, will provide more flexibility and control in the scraping process. A lot of the time, a developer can also employ reverse engineering techniques.
Reverse engineering is an activity that involves breaking down and examining the concepts that were required to build certain products. For more information on reverse engineering, please refer to the GlobalSpec article, How Does Reverse Engineering Work?, available at https://insights.globalspec.com/article/7367/how-does-reverse-engineering-work.
Here, we will introduce and explore a few of the techniques that can help and guide us in the process of data extraction.
Hyper Text Transfer Protocol (HTTP) is an application protocol that transfers resources such as HTML documents between a client and a web server. HTTP is a stateless protocol that follows the client-server model. Clients (web browsers) and web servers communicate or exchange information using HTTP Requests and HTTP Responses:
With HTTP requests or HTTP methods, a client or browser submits requests to the server. There are various methods (also known as HTTP request methods) for submitting requests, such as GET, POST, and PUT:
GET
: This is a c
ommon method for requesting information. It is considered a safe method, as the resource state is not altered. Also, it is used to provide query strings such as
http://www.test-domain.com/
,
requesting information from servers based on the
id
and
display
parameters sent with the request.
POST
: This is used to make a
secure request to a server. The requested resource state
can
be altered. Data posted or sent to the requested URL is not visible in the URL, but rather transferred with the request body. It's used to submit information to the server in secure way, such as for login and user registration.
Using the browser developer tools shown in the following screenshot, the Request Method can be revealed, along with other HTTP-related information:
We will explore more about HTTP methods in Chapter 2,Python and the Web – Using urllib and Requests, in the Implementing HTTP methods section.
HTTP headers pass additional information to a client or server while performing a request or response. Headers are generally name-value pairs of information transferred between a client and a server during their communication, and are generally grouped into request and response headers:
Request Headers
: These are headers that is used for making requests. Information such as language and encoding requests
-*
, that is referrers, cookies, browser-related information, and so on, is provided to the server while making the request. The following screenshot displays the
Request Headers
obtained from browser developer tools while making a request to
https://www.python.org
:
Response Headers
: These headers contain information about the server's response. Information regarding the response (including size, type, and date) and the
server
status
is generally found in
Response Headers
. The following s
creenshot displays the
Response Headers
obtained from the browser developer tools after making a request to
https://www.python.org
:
The information seen in the previous screenshots was captured during the request made to https://www.python.org.
HTTP Requests can also be provided with the required HTTP Headers while making requests to the server. Information related to the request URL, request method, status code, request headers, query string parameters, cookies, POST parameters, and server details can generally be explored using HTTP Headers information.
With HTTP responses, the server processes the requests, and sometimes the specified HTTP headers, that are sent to it. When requests are received and processed, it returns its response to the browser.
A response contains status codes, the meaning of which can be revealed using developer tools, as seen in the previous screenshots. The following list contains a few status codes along with some brief information:
200 (OK, request succeeded)
404 (Not found; requested resource cannot be found)
500 (Internal server error)
204 (No content to be sent)
401 (Unauthorized request was made to the server)
HTTP cookies are data sent by server to the browser. Cookies are data that's generated and stored by websites on your system or computer. Data in cookies helps to identify HTTP requests from the user to the website. Cookies contain information regarding session management, user preferences, and user behavior.
The server identifies and communicates with the browser based on the information stored in the cookie. Data stored in cookies helps a website to access and transfer certain saved values such as session ID, expiration date and time, and so on, providing quick interaction between the web request and the response:
With HTTP proxies, a proxy server acts as an intermediate server between a client and the main web server. The web browser sends requests to the server that are actually passed through the proxy, and the proxy returns the response from the server to the client.
Proxies are often used for monitoring/filtering, performance improvement, translation, and security for internet-related resources. Proxies can also be bought as a service, which may also be used to deal with cross-domain resources. There are also various forms of proxy implementation, such as web proxies (which can be used to bypass IP blocking), CGI proxies, and DNS proxies.
Cookie-based parameters that are passed in using GETrequests, HTML form-relatedPOSTrequests, and modifying or adapting headers will be crucial in managing code (that is, scripts) and accessing content during the web scraping process.
Websites are made up of pages or documents containing text, images, style sheets, and scripts, among other things. They are often built with markup languages such as Hypertext Markup Language (HTML) and Extensible Hypertext Markup Language (XHTML).
HTML is often termed as the standard markup language used for building a web page. Since the early 1990s, HTML has been used independently, as well as in conjunction with server-based scripting languages such as PHP, ASP, and JSP.
XHTML is an advanced and extended version of HTML, which is the primary markup language for web documents. XHTML is also stricter than HTML, and from the coding perspective, is an XML application.
HTML defines and contains the contents of a web page. Data that can be extracted, and any information-revealing data sources can be found inside HTML pages within a predefined instruction set or markup elements called tags. HTML tags are normally a named placeholder carrying certain predefined attributes.
HTML elements can contain some additional information, such as key/value pairs. These are also known as HTML element attributes. Attributes holds values and provide identification, or contain additional information that can be helpful in many aspects during scraping activities such as identifying exact web elements and extracting values or text from them, traversing through elements and more.
There are certain attributes that are common to HTML elements or can be applied to all HTML elements as follows. These attributes are identified as global attributes (https://developer.mozilla.org/en-US/docs/Web/HTML/Global_attributes):
id
class
style
lang
HTML elements attributes such as id and class are mostly used to identify or format individual elements, or groups of elements. These attributes can also be managed by CSS and other scripting languages.
id attribute values should be unique to the element they're applied to. class attribute values are mostly used with CSS, providing equal state formatting options, and can be used with multiple elements.
Attributes such as id and class are identified by placing # and . respectively in front of the attribute name when used with CSS, traversing, and parsing techniques.
As displayed in following examples, itemprop attributes are used to add properties to an element, whereas data-* is used to store data that is native to the element itself:
<div itemscope itemtype ="http://schema.org/Place"> <h1 itemprop="univeristy">University of Helsinki</h1> <span>Subject: <span itemprop="subject1">Artificial Intelligence</span> </span> <span itemprop="subject2">Data Science</span></div><img class="dept" src="logo.png" data-course-id="324" data-title="Predictive Aanalysis" data-x="12345" data-y="54321" data-z="56743" onclick="schedule.load()"></img>
HTML tags and attributes are a major source of data when it comes to extraction.
In the chapters ahead, we will explore these attributes using different tools. We will also perform various logical operations and use them to extract content.
Extensible Markup Language (XML) is a markup language used for distributing data over the internet, with a set of rules for encoding documents that are readable and easily exchangeable between machines and documents.
XML can use textual data across various formats and systems. XML is designed to carry portable data or data stored in tags that is not predefined with HTML tags. In XML documents, tags are created by the document developer or an automated program to describe the content they are carrying.
The following code displays some example XML content. The <employees> parent node has three <employee> child nodes, which in turn contain the other child nodes <firstName>, <lastName>, and <gender>:
<employees> <employee> <firstName>Rahul</firstName> <lastName>Reddy</lastName> <gender>Male</gender> </employee> <employee> <firstName>Aasira</firstName> <lastName>Chapagain</lastName> <gender>Female</gender> </employee> <employee> <firstName>Peter</firstName> <lastName>Lara</lastName> <gender>Male</gender> </employee></employees>
XML is an open standard, using the Unicode character set. XML is used for sharing data across various platforms and has been adopted by various web applications. Many websites use XML data, implementing its contents with the use of scripting languages and presenting it in HTML or other document formats for the end user to view.
Extraction tasks from XML documents can also be performed to obtain the contents in the desired format, or by filtering the requirement with respect to a specific need for data. Plus, behind-the-scenes data may also be obtained from certain websites only.
JavaScript Object Notation (JSON) is a format used for storing and transporting data from a server to a web page. It is language independent and is popular in web-based data-interchange actions due to its size and readability.
JSON data is normally a name/value pair that is evaluated as a JavaScript object and follows JavaScript operations. JSON and XML are often compared, as they both carry and exchange data between various web resources. JSON is also ranked higher than XML for its structure, which is simple, readable, self-descriptive, understandable, and easy to process. For web applications using JavaScript, AJAX, or RESTful services, JSON is preferred over XML due to its fast and easy operation.
JSON and JavaScript objects are interchangeable. JSON is not a markup language and it doesn't contain any tags or attributes. Instead, it is a text-only format that can be sent to/accessed through a server, as well as being managed by any programming language. JSON objects can also be expressed as arrays, dictionary, and lists as seen in the following code:
{"mymembers":[ { "firstName":"Aasira", "lastName":"Chapagain","cityName":"Kathmandu"}, { "firstName":"Rakshya", "lastName":"Dhungel","cityName":"New Delhi"}, { "firstName":"Shiba", "lastName":"Paudel","cityName":"Biratnagar"}, { "firstName":"Rahul", "lastName":"Reddy","cityName":"New Delhi"}, { "firstName":"Peter", "lastName":"Lara","cityName":"Trinidad"}]}
JSON Lines: This is a JSON-like format where each line of a record is a valid JSON value. It is also known as newline-delimited JSON, that is, individual JSON records separated by newline (\n) characters. JSON Lines formatting can be very useful when dealing with a large volume of data.
Data sources in the JSON or JSON Lines formats are preferred to XML because of the easy data pattern and code readability, which can also be managed with minimum programming effort:
{"firstName":"Aasira", "lastName":"Chapagain","cityName":"Kathmandu"} {"firstName":"Rakshya", "lastName":"Dhungel","cityName":"New Delhi"} {"firstName":"Shiba", "lastName":"Paudel","cityName":"Biratnagar"} {"firstName":"Rahul", "lastName":"Reddy","cityName":"New Delhi"} {"firstName":"Peter", "lastName":"Lara","cityName":"Trinidad"}
From the perspective of data extraction, because of the lightweight and simple structure of the JSON format, web pages use JSON content with their scripting technologies to add dynamic features.
The web-based technologies we have introduced so far deal with content, content binding, content development, and processing. Cascading Style Sheets (CSS) describes the display properties of HTML elements and the appearance of web pages. CSS is used for styling and providing the desired appearance and presentation of HTML elements.
Developers/designers can control the layout and presentation of a web document using CSS. CSS can be applied to a distinct element in a page, or it can be embedded through a separate document. Styling details can be described using the <style> tag.
The <style> tag can contain details targeting repeated and various elements in a block. As seen in the following code, multiple <a>elements exist and also possess the class and id global attributes:
<html><head> <style> a{color:blue;} h1{color:black; text-decoration:underline;} #idOne{color:red;} .classOne{color:orange;} </style></head><body> <h1> Welcome to Web Scraping </h1> Links: <a href="https://www.google.com"> Google </a> <a class='classOne' href="https://www.yahoo.com"> Yahoo </a> <a id='idOne' href="https://www.wikipedia.org"> Wikipedia </a></body></html>
Attributes that are provided with CSS properties or have been styled inside <style> tags in the preceding code block will result in the output seen here:
CSS properties can also appear in in-line structure with each particular element. In-line CSS properties override external CSS styles. The CSS colorproperty has been applied in-line to elements. This will override the color value defined inside <style>:
<h1 style ='color:orange;'> Welcome to Web Scraping </h1> Links:
