33,59 €
Develop large-scale distributed data processing applications using Spark 2 in Scala and Python
If you are an application developer, data scientist, or big data solutions architect who is interested in combining the data processing power of Spark from R, and consolidating data processing, stream processing, machine learning, and graph processing into one unified and highly interoperable framework with a uniform API using Scala or Python, this book is for you.
Spark is one of the most widely-used large-scale data processing engines and runs extremely fast. It is a framework that has tools that are equally useful for application developers as well as data scientists.
This book starts with the fundamentals of Spark 2 and covers the core data processing framework and API, installation, and application development setup. Then the Spark programming model is introduced through real-world examples followed by Spark SQL programming with DataFrames. An introduction to SparkR is covered next. Later, we cover the charting and plotting features of Python in conjunction with Spark data processing. After that, we take a look at Spark's stream processing, machine learning, and graph processing libraries. The last chapter combines all the skills you learned from the preceding chapters to develop a real-world Spark application.
By the end of this book, you will have all the knowledge you need to develop efficient large-scale applications using Apache Spark.
Learn about Spark's infrastructure with this practical tutorial. With the help of real-world use cases on the main features of Spark we offer an easy introduction to the framework.
Sie lesen das E-Book in den Legimi-Apps auf:
Seitenzahl: 389
Veröffentlichungsjahr: 2016
Copyright © 2016 Packt Publishing
All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.
Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing, and its dealers and distributors will be held liable for any damages caused or alleged to be caused directly or indirectly by this book.
Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.
First published: September 2016
Production reference: 1260916
Published by Packt Publishing Ltd.
Livery Place
35 Livery Street
Birmingham
B3 2PB, UK.
ISBN 978-1-78588-500-6
www.packtpub.com
Author
Rajanarayanan Thottuvaikkatumana
Copy Editor
Safis editing
Reviewer
Kornel Skałkowski
Project Coordinator
Devanshi Doshi
Acquisition Editor
Tushar Gupta
Proofreader
Safis Editing
Content Development Editor
Samantha Gonsalves
Indexer
Rekha Nair
Technical Editor
Jayesh Sonawane
Graphics
Jason Monteiro
Production Coordinator
Aparna Bhagat
Rajanarayanan Thottuvaikkatumana, Raj, is a seasoned technologist with more than 23 years of software development experience at various multinational companies. He has lived and worked in India, Singapore, and the USA, and is presently based out of the UK. His experience includes architecting, designing, and developing software applications. He has worked on various technologies including major databases, application development platforms, web technologies, and big data technologies. Since 2000, he has been working mainly in Java related technologies, and does heavy-duty server-side programming in Java and Scala. He has worked on very highly concurrent, highly distributed, and high transaction volume systems. Currently he is building a next generation Hadoop YARN-based data processing platform and an application suite built with Spark using Scala.
Raj holds one master's degree in Mathematics, one master's degree in Computer Information Systems and has many certifications in ITIL and cloud computing to his credit. Raj is the author of Cassandra Design Patterns - Second Edition, published by Packt.
When not working on the assignments his day job demands, Raj is an avid listener to classical music and watches a lot of tennis.
Kornel Skałkowski has a solid academic and industrial background. For more than five years, he worked as an assistant at AGH University of Science and Technology in Krakow. In 2015, he obtained his Ph.D. in the subject of machine learning-based adaptation of SOA systems. He has cooperated with several companies on various projects concerning intelligent systems, machine learning and big data. Currently, he works as a big data developer for SAP SE.
He is a co-author of 19 papers concerning software engineering, SOA systems and machine learning. He also works as a reviewer for the American Journal of Software Engineering and Applications. He has participated in numerous European and national scientific projects. His research interests include machine learning, big data and software engineering.
He is author of the book Data Lake Development for Big Data.
I would like to kindly thank my family, my relatives and my friends for their endless patience and support during reviewing this book. I would also like to express my special gratitude to my girlfriend Ania, for her understanding us missing time together.
Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.PacktPub.com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at [email protected] for more details.
At www.PacktPub.com, you can also read a collection of free technical articles, sign up for a range of free newsletters and receive exclusive discounts and offers on Packt books and eBooks.
https://www.packtpub.com/mapt
Get the most in-demand software skills with Mapt. Mapt gives you full access to all Packt books and video courses, as well as industry-leading tools to help you plan your personal development and advance your career.
Why subscribe?
Dedicating this book to the countless volunteers who worked tirelessly to build high production-quality open source software products. Without them I wouldn't have written this book.
The data processing framework named Spark was first built to prove that, by re-using the data sets across a number of iterations, it provided value where Hadoop MapReduce jobs performed poorly. The research paper Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center talks about the philosophy behind the design of Spark. A very simplistic reference implementation built to test Mesos by the University of California Berkeley researchers has grown far and beyond to become a full blown data processing framework later became one of the most active Apache projects. It is designed from the ground up to do distributed data processing on clusters such as Hadoop, Mesos, and in standalone mode. Spark is a JVM-based data processing framework and hence it works on most operating systems that support JVM-based applications. Spark is widely installed on UNIX and Mac OS X, platforms and Windows adoption is increasing.
Spark provides a unified programming model using the programming languages Scala, Java, Python and R. In other words, irrespective of the language used to program Spark applications, the API remains almost the same in all the languages. In this way, organizations can adopt Spark and develop applications in their programming language of choice. This also enables fast porting of Spark applications from one language to another without much effort, if there is a need. Most of Spark is developed using Scala and because of that the Spark programming model inherently supports functional programming principles. The most basic Spark data abstraction is the resilient distributed data set (RDD), based on which all the other libraries are built. The RDD-based Spark programming model is the lowest level where developers can build data processing applications.
Spark has grown fast, to cater to the needs of more data processing use cases. When such a forward-looking step is taken with respect to the product road map, the requirement emerged to make the programming more high level for business users. The Spark SQL library on top of Spark Core, with its DataFrame abstraction, was built to cater to the needs of the huge population of developers who are very conversant with the ubiquitous SQL.
Data scientists use R for their computation needs. The biggest limitation of R is that all the data that needs to be processed should fit into the main memory of the computer on which the R program is running. The R API for Spark introduced data scientists to the world of distributed data processing in their familiar data frame abstraction. In other words, using the R API for Spark, the processing of data can be done in parallel on Hadoop or Mesos, growing far beyond the limitation of the resident memory of the host computer.
In the present era of large-scale applications that collect data, the velocity of the data that is ingested is very high. Many application use cases mandate real-time processing of the data that is streamed. The Spark Streaming library, built on top of Spark Core, does exactly the same.
The data at rest or the data that is streamed are fed to machine learning algorithms to train data models and use them to provide answers to business questions. All the machine learning frameworks that were created before Spark had many limitations in terms of the memory of the processing computer, inability to do parallel processing, repeated read-write cycles, so on. Spark doesn't have any of these limitations and hence the Spark MLlib machine learning library, built on top of Spark Core and Spark DataFrames, turned out to be the best of breed machine learning library that glues together the data processing pipelines and machine learning activities.
Graph is a very useful data structure used heavily in some special use cases. The algorithms used to process the data in a graph data structure are computationally intensive. Before Spark, many graph processing frameworks came along, and some of them were really fast at processing, but pre-processing the data needed to produce the graph data structure turned out to be a big bottleneck in most of these graph processing applications. The Spark GraphX library, built on top of Spark, filled this gap to make data processing and graph processing as chained activities.
In the past, many data processing frameworks existed and many of them were proprietary forcing organizations to get into the trap of vendor lock-in. Spark provided a very viable alternative for a wide variety of data processing needs with no licensing cost; at the same time, it was backed by many leading companies, providing professional production support.
Chapter 1, Spark Fundamentals, discusses the fundamentals of Spark as a framework with its APIs and the libraries that comes with it, along with the whole data processing ecosystem Spark is interacting with.
Chapter 2, Spark Programming Model, discusses the uniform programming model, based on the tenets of functional programming methodology, that is used in Spark, and covers the fundamentals of resilient distributed data sets (RDD), Spark transformations, and Spark actions.
Chapter 3, Spark SQL, discusses Spark SQL, which is one of the most powerful Spark libraries used to manipulate data using the ubiquitous SQL constructs in conjunction with the Spark DataFrame API, and and how it works with Spark programs. This chapter also discusses how Spark SQL is used to access data from various data sources, enabling the unification of diverse data sources for data processing.
Chapter 4, Spark Programming with R, discusses SparkR or R on Spark, which is the R API for Spark; this enables R users to make use of the data processing capabilities of Spark using their familiar data frame abstraction. It gives a very good foundation for R users to get acquainted with the Spark data processing ecosystem.
Chapter 5, Spark Data Analysis with Python, discusses the use of Spark to do data processing and Python to do data analysis, using a wide variety of charting and plotting libraries available for Python. This chapter discusses combining these two related activities together as a Spark application with Python as the programming language of choice.
Chapter 6, Spark Stream Processing, discusses Spark Streaming, which is one of the most powerful Spark libraries to capture and process data that is ingested as a stream. Kafka as the distributed message broker and a Spark Streaming application as the consumer are also discussed.
Chapter 7, Spark Machine Learning, discusses Spark MLlib, which is one of the most powerful Spark libraries used to develop machine learning applications at an introductory level.
Chapter 8, Spark Graph Processing, discusses Spark GraphX, which is one of the most powerful Spark libraries to process graph data structures, and comes with lots of algorithms to process data in graphs. This chapter covers the basics of GraphX and some use cases implemented using the algorithms provided by GraphX.
Chapter 9, Designing Spark Applications, discusses the design and development of a Spark data processing application, covering various features of Spark that were covered in the previous chapters of this book.
Spark 2.0.0 or above is to be installed on at least a standalone machine to run the code samples and do further activities to learn more about the subject. For Chapter 6, Spark Stream Processing, Kafka needs to be installed and configured as a message broker with its command line producer producing messages and the application developed using Spark as a consumer of those messages.
If you are an application developer, data scientist, or big data solutions architect who is interested in combining the data processing power of Spark with R, and consolidating data processing, stream processing, machine learning, and graph processing into one unified and highly interoperable framework with a uniform API using Scala or Python, this book is for you.
In this book, you will find a number of text styles that distinguish between different kinds of information. Here are some examples of these styles and an explanation of their meaning.
Code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles are shown as follows: " It is a good idea to customize this property spark.driver.memory to have a higher value."
A block of code is set as follows:
Python 3.5.0 (v3.5.0:374f501f4567, Sep 12 2015, 11:00:19) [GCC 4.2.1 (Apple Inc. build 5666) (dot 3)] on darwinAny command-line input or output is written as follows:
$ python Python 3.5.0 (v3.5.0:374f501f4567, Sep 12 2015, 11:00:19) [GCC 4.2.1 (Apple Inc. build 5666) (dot 3)] on darwin Type "help", "copyright", "credits" or "license" for more information. >>>New terms and important words are shown in bold. Words that you see on the screen, for example, in menus or dialog boxes, appear in the text like this: "The shortcuts in this book are based on the Mac OS X 10.5+ scheme."
Warnings or important notes appear in a box like this.
Tips and tricks appear like this.
Feedback from our readers is always welcome. Let us know what you think about this book-what you liked or disliked. Reader feedback is important for us as it helps us develop titles that you will really get the most out of. To send us general feedback, simply e-mail [email protected], and mention the book's title in the subject of your message. If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, see our author guide at www.packtpub.com/authors.
Now that you are the proud owner of a Packt book, we have a number of things to help you to get the most from your purchase.
You can download the example code files for this book from your account at http://www.packtpub.com. If you purchased this book elsewhere, you can visit http://www.packtpub.com/support and register to have the files e-mailed directly to you.
You can download the code files by following these steps:
Once the file is downloaded, please make sure that you unzip or extract the folder using the latest version of:
The code bundle for the book is also hosted on GitHub at https://github.com/PacktPublishing/Apache-Spark-2-for-Beginners. We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/. Check them out!
We also provide you with a PDF file that has color images of the screenshots/diagrams used in this book. The color images will help you better understand the changes in the output. You can download this file from http://www.packtpub.com/sites/default/files/downloads/ApacheSpark2forBeginners_ColorImages.pdf.
Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you find a mistake in one of our books-maybe a mistake in the text or the code-we would be grateful if you could report this to us. By doing so, you can save other readers from frustration and help us improve subsequent versions of this book. If you find any errata, please report them by visiting http://www.packtpub.com/submit-errata, selecting your book, clicking on the Errata Submission Form link, and entering the details of your errata. Once your errata are verified, your submission will be accepted and the errata will be uploaded to our website or added to any list of existing errata under the Errata section of that title.
To view the previously submitted errata, go to https://www.packtpub.com/books/content/support and enter the name of the book in the search field. The required information will appear under the Errata section.
Piracy of copyrighted material on the Internet is an ongoing problem across all media. At Packt, we take the protection of our copyright and licenses very seriously. If you come across any illegal copies of our works in any form on the Internet, please provide us with the location address or website name immediately so that we can pursue a remedy.
Please contact us at [email protected] with a link to the suspected pirated material.
We appreciate your help in protecting our authors and our ability to bring you valuable content.
If you have a problem with any aspect of this book, you can contact us at [email protected], and we will do our best to address the problem.
Data is one of the most important assets of any organization. The scale at which data is being collected and used in organizations is growing beyond imagination. The speed at which data is being ingested, the variety of the data types in use, and the amount of data that is being processed and stored are breaking all-time records every moment. It is very common these days, even in small-scale organizations, that data is growing from gigabytes to terabytes to petabytes. For the same reason, the processing needs are also growing that ask for capability to process data at rest as well as data on the move.
Take any organization; its success depends on the decisions made by its leaders and for making sound decisions, you need the backing of good data and the information generated by processing the data. This poses a big challenge on how to process the data in a timely and cost-effective manner so that right decisions can be made. Data processing techniques have evolved since the early days of computers. Countless data processing products and frameworks came into the market and disappeared over these years. Most of these data processing products and frameworks were not general purpose in nature. Most of the organizations relied on their own bespoke applications for their data processing needs, in a silo way, or in conjunction with specific products.
Large-scale Internet applications, popularly known as Internet of Things (IoT) applications, heralded the common need to have open frameworks to process huge amounts of data ingested at great speed dealing with various types of data. Large-scale web sites, media streaming applications, and the huge batch processing needs of organizations made the need even more relevant. The open source community is also growing considerably along with the growth of the Internet, delivering production quality software supported by reputed software companies. A huge number of companies started using open source software and started deploying them in their production environments.
In a technological perspective, the data processing needs were facing huge challenges. The amount of data started overflowing from single machines to clusters of huge numbers of machines. The processing power of the single CPU plateaued and modern computers started combining them together to get more processing power, known as multi-core computers. The applications were not designed and developed to make use of all the processors in a multi-core computer and wasted lots of the processing power available in a typical modern computer.
Throughout this book, the terms node, host, and machine refer to a computer that is running in a standalone mode or in a cluster.
In this context, what are the qualities an ideal data processing framework should possess?
There are two open source data processing frameworks that are worth mentioning that satisfy all these requirements. The first is being Apache Hadoop and the second one is Apache Spark.
We will cover the following topics in this chapter:
Apache Hadoop is an open source software framework designed from ground-up to do distributed data storage on a cluster of computers and to do distributed data processing of the data that is spread across the cluster of computers. This framework comes with a distributed filesystem for the data storage, namely, Hadoop Distributed File System (HDFS), and a data processing framework, namely, MapReduce. The creation of HDFS is inspired from the Google research paper, The Google File System and MapReduce is based on the Google research paper, MapReduce: Simplified Data Processing on Large Clusters.
Hadoop was adopted by organizations in a really big way by implementing huge Hadoop clusters for data processing. It saw tremendous growth from Hadoop MapReduce version 1 (MRv1) to Hadoop MapReduce version 2 (MRv2). From a pure data processing perspective, MRv1 consisted of HDFS and MapReduce as the core components. Many applications, generally called SQL-on-Hadoop applications, such as Hive and Pig, were stacked on top of the MapReduce framework. It is very common to see that even though these types of applications are separate Apache projects, as a suite, many such projects provide great value.
The Yet Another Resource Negotiator (YARN) project came to the fore with computing frameworks other than MapReduce type to run on the Hadoop ecosystem. With the introduction of YARN sitting on top of HDFS, and below MapReduce in a component architecture layering perspective, the users could write their own applications that can run on YARN and HDFS to make use of the distributed data storage and data processing capabilities of the Hadoop ecosystem. In other words, the newly overhauled MapReduce version 2 (MRv2) became one of the application frameworks sitting on top of HDFS and YARN.
Figure 1 gives a brief idea about these components and how they are stacked together:
Figure 1
MapReduce is a generic data processing model. The data processing goes through two steps, namely, map step and reduce step. In the first step, the input data is divided into a number of smaller parts so that each one of them can be processed independently. Once the map step is completed, its output is consolidated and the final result is generated in the reduce step. In a typical word count example, the creation of key-value pairs with each word as the key and the value 1 is the map step. The sorting of these pairs on the key, summing the values of the pairs with the same key falls into an intermediate combine step. Producing the pairs containing unique words and their occurrence count is the reduce step.
From an application programming perspective, the basic ingredients for an over-simplified MapReduce application are as follows:
The MapReduce job is submitted for running in Hadoop and once the job is completed, the output can be taken from the output location specified.
This two-step process of dividing a MapReduce data processing job to map and reduce tasks was highly effective and turned out to be a perfect fit for many batch data processing use cases. There is a lot of Input/Output (I/O) operations with the disk happening under the hood during the whole process. Even in the intermediate steps of the MapReduce job, if the internal data structures are filled with data or when the tasks are completed beyond a certain percentage, writing to the disk happens. Because of this, the subsequent steps in the MapReduce jobs have to read from the disk.
Then the other biggest challenge comes when there are multiple MapReduce jobs to be completed in a chained fashion. In other words, if a big data processing work is to be accomplished by two MapReduce jobs in such a way that the output of the first MapReduce job is the input of the second MapReduce job. In this situation, whatever may be the size of the output of the first MapReduce job, it has to be written to the disk before the second MapReduce could use it as its input. So in this simple case, there is a definite and unnecessary write operation.
In many of the batch data processing use cases, these I/O operations are not a big issue. If the results are highly reliable, for many batch data processing use cases, latency is tolerated. But the biggest challenge comes when doing real-time data processing. The huge amount of I/O operations involved in MapReduce jobs makes it unsuitable for real-time data processing with the lowest possible latency.
