45,59 €
Advanced analytics on your Big Data with latest Apache Spark 2.x
If you are a developer with some experience with Spark and want to strengthen your knowledge of how to get around in the world of Spark, then this book is ideal for you. Basic knowledge of Linux, Hadoop and Spark is assumed. Reasonable knowledge of Scala is expected.
Apache Spark is an in-memory cluster-based parallel processing system that provides a wide range of functionalities such as graph processing, machine learning, stream processing, and SQL. This book aims to take your knowledge of Spark to the next level by teaching you how to expand Spark's functionality and implement your data flows and machine/deep learning programs on top of the platform.
The book commences with an overview of the Spark ecosystem. It will introduce you to Project Tungsten and Catalyst, two of the major advancements of Apache Spark 2.x.
You will understand how memory management and binary processing, cache-aware computation, and code generation are used to speed things up dramatically. The book extends to show how to incorporate H20, SystemML, and Deeplearning4j for machine learning, and Jupyter Notebooks and Kubernetes/Docker for cloud-based Spark. During the course of the book, you will learn about the latest enhancements to Apache Spark 2.x, such as interactive querying of live data and unifying DataFrames and Datasets.
You will also learn about the updates on the APIs and how DataFrames and Datasets affect SQL, machine learning, graph processing, and streaming. You will learn to use Spark as a big data operating system, understand how to implement advanced analytics on the new APIs, and explore how easy it is to use Spark in day-to-day tasks.
This book is an extensive guide to Apache Spark modules and tools and shows how Spark's functionality can be extended for real-time processing and storage with worked examples.
Sie lesen das E-Book in den Legimi-Apps auf:
Seitenzahl: 338
Veröffentlichungsjahr: 2017
Copyright © 2017 Packt Publishing
All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews. Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing, and its dealers and distributors will be held liable for any damages caused or alleged to be caused directly or indirectly by this book. Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.
First published: September 2015
Second Edition: July 2017
Production reference: 1190717
ISBN 978-1-78646-274-9
www.packtpub.com
Author
Romeo Kienzler
Copy Editor
Tasneem Fatehi
Reviewer
Md. Rezaul Karim
Project Coordinator
Manthan Patel
Commissioning Editor
Amey Varangaonkar
Proofreader
Safis Editing
Acquisition Editor
Malaika Monteiro
Indexer
Tejal Daruwale Soni
Content Development Editor
Tejas Limkar
Graphics
Tania Dutta
Technical Editor
Dinesh Chaudhary
Production Coordinator
Deepika Naik
Romeo Kienzler works as the chief data scientist in the IBM Watson IoT worldwide team, helping clients to apply advanced machine learning at scale on their IoT sensor data. He holds a Master's degree in computer science from the Swiss Federal Institute of Technology, Zurich, with a specialization in information systems, bioinformatics, and applied statistics. His current research focus is on scalable machine learning on Apache Spark. He is a contributor to various open source projects and works as an associate professor for artificial intelligence at Swiss University of Applied Sciences, Berne. He is a member of the IBM Technical Expert Council and the IBM Academy of Technology, IBM's leading brains trust.
Md. Rezaul Karim is a research scientist at Fraunhofer Institute for Applied Information Technology FIT, Germany. He is also a PhD candidate at the RWTH Aachen University, Aachen, Germany. He holds a BSc and an MSc degree in computer science. Before joining the Fraunhofer-FIT, he worked as a researcher at Insight Centre for Data Analytics, Ireland. Prior to that, he worked as a lead engineer with Samsung Electronics' distributed R&D Institutes in Korea, India, Vietnam, Turkey, and Bangladesh. Previously, he worked as a research assistant in the Database Lab at Kyung Hee University, Korea. He also worked as an R&D engineer with BMTech21 Worldwide, Korea. Prior to that, he worked as a software engineer with i2SoftTechnology, Dhaka, Bangladesh.
He has more than 8 years' experience in the area of Research and Development with a solid knowledge of algorithms and data structures in C/C++, Java, Scala, R, and Python focusing on big data technologies (such as Spark, Kafka, DC/OS, Docker, Mesos, Zeppelin, Hadoop, and MapReduce) and Deep Learning technologies such as TensorFlow, DeepLearning4j, and H2O-Sparking Water. His research interests include machine learning, deep learning, semantic web/linked data, big data, and bioinformatics. He is the author of the following books with Packt Publishing:
Large-Scale Machine Learning with Spark
Deep Learning with TensorFlow
Scala and Spark for Big Data Analytics
For support files and downloads related to your book, please visit www.PacktPub.com.
Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at; www.PacktPub.com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at [email protected] for more details.
At www.PacktPub.com, you can also read a collection of free technical articles, sign up for a range of free newsletters and receive exclusive discounts and offers on Packt books and eBooks.
https://www.packtpub.com/mapt
Get the most in-demand software skills with Mapt. Mapt gives you full access to all Packt books and video courses, as well as industry-leading tools to help you plan your personal development and advance your career.
Fully searchable across every book published by Packt
Copy and paste, print, and bookmark content
On demand and accessible via a web browser
Thanks for purchasing this Packt book. At Packt, quality is at the heart of our editorial process. To help us improve, please leave us an honest review on this book's Amazon page at https://www.amazon.com/dp/1786462745.
If you'd like to join our team of regular reviewers, you can e-mail us at [email protected]. We award our regular reviewers with free eBooks and videos in exchange for their valuable feedback. Help us be relentless in improving our products!
Preface
What this book covers
What you need for this book
Who this book is for
Conventions
Reader feedback
Customer support
Downloading the example code
Downloading the color images of this book
Errata
Piracy
Questions
A First Taste and What’s New in Apache Spark V2
Spark machine learning
Spark Streaming
Spark SQL
Spark graph processing
Extended ecosystem
What's new in Apache Spark V2?
Cluster design
Cluster management
Local
Standalone
Apache YARN
Apache Mesos
Cloud-based deployments
Performance
The cluster structure
Hadoop Distributed File System
Data locality
Memory
Coding
Cloud
Summary
Apache Spark SQL
The SparkSession--your gateway to structured data processing
Importing and saving data
Processing the text files
Processing JSON files
Processing the Parquet files
Understanding the DataSource API
Implicit schema discovery
Predicate push-down on smart data sources
DataFrames
Using SQL
Defining schemas manually
Using SQL subqueries
Applying SQL table joins
Using Datasets
The Dataset API in action
User-defined functions
RDDs versus DataFrames versus Datasets
Summary
The Catalyst Optimizer
Understanding the workings of the Catalyst Optimizer
Managing temporary views with the catalog API
The SQL abstract syntax tree
How to go from Unresolved Logical Execution Plan to Resolved Logical Execution Plan
Internal class and object representations of LEPs
How to optimize the Resolved Logical Execution Plan
Physical Execution Plan generation and selection
Code generation
Practical examples
Using the explain method to obtain the PEP
How smart data sources work internally
Summary
Project Tungsten
Memory management beyond the Java Virtual Machine Garbage Collector
Understanding the UnsafeRow object
The null bit set region
The fixed length values region
The variable length values region
Understanding the BytesToBytesMap
A practical example on memory usage and performance
Cache-friendly layout of data in memory
Cache eviction strategies and pre-fetching
Code generation
Understanding columnar storage
Understanding whole stage code generation
A practical example on whole stage code generation performance
Operator fusing versus the volcano iterator model
Summary
Apache Spark Streaming
Overview
Errors and recovery
Checkpointing
Streaming sources
TCP stream
File streams
Flume
Kafka
Summary
Structured Streaming
The concept of continuous applications
True unification - same code, same engine
Windowing
How streaming engines use windowing
How Apache Spark improves windowing
Increased performance with good old friends
How transparent fault tolerance and exactly-once delivery guarantee is achieved
Replayable sources can replay streams from a given offset
Idempotent sinks prevent data duplication
State versioning guarantees consistent results after reruns
Example - connection to a MQTT message broker
Controlling continuous applications
More on stream life cycle management
Summary
Apache Spark MLlib
Architecture
The development environment
Classification with Naive Bayes
Theory on Classification
Naive Bayes in practice
Clustering with K-Means
Theory on Clustering
K-Means in practice
Artificial neural networks
ANN in practice
Summary
Apache SparkML
What does the new API look like?
The concept of pipelines
Transformers
String indexer
OneHotEncoder
VectorAssembler
Pipelines
Estimators
RandomForestClassifier
Model evaluation
CrossValidation and hyperparameter tuning
CrossValidation
Hyperparameter tuning
Winning a Kaggle competition with Apache SparkML
Data preparation
Feature engineering
Testing the feature engineering pipeline
Training the machine learning model
Model evaluation
CrossValidation and hyperparameter tuning
Using the evaluator to assess the quality of the cross-validated and tuned model
Summary
Apache SystemML
Why do we need just another library?
Why on Apache Spark?
The history of Apache SystemML
A cost-based optimizer for machine learning algorithms
An example - alternating least squares
ApacheSystemML architecture
Language parsing
High-level operators are generated
How low-level operators are optimized on
Performance measurements
Apache SystemML in action
Summary
Deep Learning on Apache Spark with DeepLearning4j and H2O
H2O
Overview
The build environment
Architecture
Sourcing the data
Data quality
Performance tuning
Deep Learning
Example code – income
The example code – MNIST
H2O Flow
Deeplearning4j
ND4J - high performance linear algebra for the JVM
Deeplearning4j
Example: an IoT real-time anomaly detector
Mastering chaos: the Lorenz attractor model
Deploying the test data generator
Deploy the Node-RED IoT Starter Boilerplate to the IBM Cloud
Deploying the test data generator flow
Testing the test data generator
Install the Deeplearning4j example within Eclipse
Running the examples in Eclipse
Run the examples in Apache Spark
Summary
Apache Spark GraphX
Overview
Graph analytics/processing with GraphX
The raw data
Creating a graph
Example 1 – counting
Example 2 – filtering
Example 3 – PageRank
Example 4 – triangle counting
Example 5 – connected components
Summary
Apache Spark GraphFrames
Architecture
Graph-relational translation
Materialized views
Join elimination
Join reordering
Examples
Example 1 – counting
Example 2 – filtering
Example 3 – page rank
Example 4 – triangle counting
Example 5 – connected components
Summary
Apache Spark with Jupyter Notebooks on IBM DataScience Experience
Why notebooks are the new standard
Learning by example
The IEEE PHM 2012 data challenge bearing dataset
ETL with Scala
Interactive, exploratory analysis using Python and Pixiedust
Real data science work with SparkR
Summary
Apache Spark on Kubernetes
Bare metal, virtual machines, and containers
Containerization
Namespaces
Control groups
Linux containers
Understanding the core concepts of Docker
Understanding Kubernetes
Using Kubernetes for provisioning containerized Spark applications
Example--Apache Spark on Kubernetes
Prerequisites
Deploying the Apache Spark master
Deploying the Apache Spark workers
Deploying the Zeppelin notebooks
Summary
Apache Spark is an in-memory, cluster-based, parallel processing system that provides a wide range of functionality such as graph processing, machine learning, stream processing, and SQL. This book aims to take your limited knowledge of Spark to the next level by teaching you how to expand your Spark functionality. The book opens with an overview of the Spark ecosystem. The book will introduce you to Project Catalyst and Tungsten. You will understand how Memory Management and Binary Processing, Cache-aware Computation, and Code Generation are used to speed things up dramatically. The book goes on to show how to incorporate H20 and Deeplearning4j for machine learning and Juypter Notebooks, Zeppelin, Docker and Kubernetes for cloud-based Spark. During the course of the book, you will also learn about the latest enhancements in Apache Spark 2.2, such as using the DataFrame and Dataset APIs exclusively, building advanced, fully automated Machine Learning pipelines with SparkML and perform graph analysis using the new GraphFrames API.
Chapter 1, A First Taste and What's New in Apache Spark V2, provides an overview of Apache Spark, the functionality that is available within its modules, and how it can be extended. It covers the tools available in the Apache Spark ecosystem outside the standard Apache Spark modules for processing and storage. It also provides tips on performance tuning.
Chapter 2, Apache Spark SQL, creates a schema in Spark SQL, shows how data can be queried efficiently using the relational API on DataFrames and Datasets, and explores SQL.
Chapter 3, The Catalyst Optimizer, explains what a cost-based optimizer in database systems is and why it is necessary. You will master the features and limitations of the Catalyst Optimizer in Apache Spark.
Chapter 4, Project Tungsten, explains why Project Tungsten is essential for Apache Spark and also goes on to explain how Memory Management, Cache-aware Computation, and Code Generation are used to speed things up dramatically.
Chapter 5, Apache Spark Streaming, talks about continuous applications using Apache Spark streaming. You will learn how to incrementally process data and create actionable insights.
Chapter 6, Structured Streaming, talks about Structured Streaming – a new way of defining continuous applications using the DataFrame and Dataset APIs.
Chapter 7, Classical MLlib, introduces you to MLlib, the de facto standard for machine learning when using Apache Spark.
Chapter 8, Apache SparkML, introduces you to the DataFrame-based machine learning library of Apache Spark: the new first-class citizen when it comes to high performance and massively parallel machine learning.
Chapter 9, Apache SystemML, introduces you to Apache SystemML, another machine learning library capable of running on top of Apache Spark and incorporating advanced features such as a cost-based optimizer, hybrid execution plans, and low-level operator re-writes.
Chapter 10,Deep Learning on Apache Spark using H20 and DeepLearning4j, explains that deep learning is currently outperforming one traditional machine learning discipline after the other. We have three open source first-class deep learning libraries running on top of Apache Spark, which are H2O, DeepLearning4j, and Apache SystemML. Let's understand what Deep Learning is and how to use it on top of Apache Spark using these libraries.
Chapter 11, Apache Spark GraphX, talks about Graph processing with Scala using GraphX. You will learn some basic and also advanced graph algorithms and how to use GraphX to execute them.
Chapter 12, Apache Spark GraphFrames, discusses graph processing with Scala using GraphFrames. You will learn some basic and also advanced graph algorithms and also how GraphFrames differ from GraphX in execution.
Chapter 13, Apache Spark with Jupyter Notebooks on IBM DataScience Experience, introduces a Platform as a Service offering from IBM, which is completely based on an Open Source stack and on open standards. The main advantage is that you have no vendor lock-in. Everything you learn here can be installed and used in other clouds, in a local datacenter, or on your local laptop or PC.
Chapter 14, Apache Spark onKubernetes, explains that Platform as a Service cloud providers completely manage the operations part of an Apache Spark cluster for you. This is an advantage but sometimes you have to access individual cluster nodes for debugging and tweaking and you don't want to deal with the complexity that maintaining a real cluster on bare-metal or virtual systems entails. Here, Kubernetes might be the best solution. Therefore, in this chapter, we explain what Kubernetes is and how it can be used to set up an Apache Spark cluster.
You will need the following to work with the examples in this book:
A laptop or PC with at least 6 GB main memory running Windows, macOS, or Linux
VirtualBox 5.1.22 or above
Hortonworks HDP Sandbox V2.6 or above
Eclipse Neon or above
Maven
Eclipse Maven Plugin
Eclipse Scala Plugin
Eclipse Git Plugin
This book is an extensive guide to Apache Spark from the programmer's and data scientist's perspective. It covers Apache Spark in depth, but also supplies practical working examples for different domains. Operational aspects are explained in sections on performance tuning and cloud deployments. All the chapters have working examples, which can be replicated easily.
In this book, you will find a number of text styles that distinguish between different kinds of information. Here are some examples of these styles and an explanation of their meaning.
Code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles are shown as follows: "The next lines of code read the link and assign it to the to theBeautifulSoupfunction."
A block of code is set as follows:
import org.apache.spark.SparkContextimport org.apache.spark.SparkContext._import org.apache.spark.SparkConf
Any command-line input or output is written as follows:
[hadoop@hc2nn ~]# sudo su -[root@hc2nn ~]# cd /tmp
New terms and important words are shown in bold. Words that you see on the screen, for example, in menus or dialog boxes, appear in the text like this: "In order to download new modules, we will go toFiles|Settings|Project Name|Project Interpreter."
Feedback from our readers is always welcome. Let us know what you think about this book-what you liked or disliked. Reader feedback is important for us as it helps us develop titles that you will really get the most out of.
To send us general feedback, simply e-mail [email protected], and mention the book's title in the subject of your message.
If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, see our author guide at www.packtpub.com/authors.
Now that you are the proud owner of a Packt book, we have a number of things to help you to get the most from your purchase.
You can download the example code files for this book from your account at http://www.packtpub.com. If you purchased this book elsewhere, you can visit http://www.packtpub.com/support and register to have the files e-mailed directly to you.
You can download the code files by following these steps:
Log in or register to our website using your e-mail address and password.
Hover the mouse pointer on the
SUPPORT
tab at the top.
Click on
Code Downloads & Errata
.
Enter the name of the book in the
Search
box.
Select the book for which you're looking to download the code files.
Choose from the drop-down menu where you purchased this book from.
Click on
Code Download
.
Once the file is downloaded, please make sure that you unzip or extract the folder using the latest version of:
WinRAR / 7-Zip for Windows
Zipeg / iZip / UnRarX for Mac
7-Zip / PeaZip for Linux
The code bundle for the book is also hosted on GitHub at https://github.com/PacktPublishing/Mastering-Apache-Spark-2x. We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/. Check them out!
We also provide you with a PDF file that has color images of the screenshots/diagrams used in this book. The color images will help you better understand the changes in the output. You can download this file from https://www.packtpub.com/sites/default/files/downloads/MasteringApacheSpark2x_ColorImages.pdf.
Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you find a mistake in one of our books-maybe a mistake in the text or the code-we would be grateful if you could report this to us. By doing so, you can save other readers from frustration and help us improve subsequent versions of this book. If you find any errata, please report them by visiting http://www.packtpub.com/submit-errata, selecting your book, clicking on the Errata Submission Form link, and entering the details of your errata. Once your errata are verified, your submission will be accepted and the errata will be uploaded to our website or added to any list of existing errata under the Errata section of that title.
To view the previously submitted errata, go to https://www.packtpub.com/books/content/support and enter the name of the book in the search field. The required information will appear under the Errata section.
Piracy of copyrighted material on the Internet is an ongoing problem across all media. At Packt, we take the protection of our copyright and licenses very seriously. If you come across any illegal copies of our works in any form on the Internet, please provide us with the location address or website name immediately so that we can pursue a remedy.
Please contact us at [email protected] with a link to the suspected pirated material.
We appreciate your help in protecting our authors and our ability to bring you valuable content.
If you have a problem with any aspect of this book, you can contact us at [email protected], and we will do our best to address the problem.
Apache Spark is a distributed and highly scalable in-memory data analytics system, providing you with the ability to develop applications in Java, Scala, and Python, as well as languages such as R. It has one of the highest contribution/involvement rates among the Apache top-level projects at this time. Apache systems, such as Mahout, now use it as a processing engine instead of MapReduce. It is also possible to use a Hive context to have the Spark applications process data directly to and from Apache Hive.
Initially, Apache Spark provided four main submodules--SQL, MLlib, GraphX, and Streaming. They will all be explained in their own chapters, but a simple overview would be useful here. The modules are interoperable, so data can be passed between them. For instance, streamed data can be passed to SQL and a temporary table can be created. Since version 1.6.0, MLlib has a sibling called SparkML with a different API, which we will cover in later chapters.
The following figure explains how this book will address Apache Spark and its modules:
Machine learning is the real reason for Apache Spark because, at the end of the day, you don't want to just ship and transform data from A to B (a process called ETL (Extract Transform Load)). You want to run advanced data analysis algorithms on top of your data, and you want to run these algorithms at scale. This is where Apache Spark kicks in.
Apache Spark, in its core, provides the runtime for massive parallel data processing, and different parallel machine learning libraries are running on top of it. This is because there is an abundance on machine learning algorithms for popular programming languages like R and Python but they are not scalable. As soon as you load more data to the available main memory of the system, they crash.
Apache Spark in contrast can make use of multiple computer nodes to form a cluster and even on a single node can spill data transparently to disk therefore avoiding the main memory bottleneck. Two interesting machine learning libraries are shipped with Apache Spark, but in this work we'll also cover third-party machine learning libraries.
The Spark MLlib module, Classical MLlib, offers a growing but incomplete list of machine learning algorithms. Since the introduction of the DataFrame-based machine learning API called SparkML, the destiny of MLlib is clear. It is only kept for backward compatibility reasons.
This is indeed a very wise decision, as we will discover in the next two chapters that structured data processing and the related optimization frameworks are currently disrupting the whole Apache Spark ecosystem. In SparkML, we have a machine learning library in place that can take advantage of these improvements out of the box, using it as an underlying layer.
Deep learning on Apache Spark uses H20, Deeplearning4j and Apache SystemML, which are other examples of very interesting third-party machine learning libraries that are not shipped with the Apache Spark distribution.
While H20 is somehow complementary to MLlib, Deeplearning4j only focuses on deep learning algorithms. Both use Apache Spark as a means for parallelization of data processing. You might wonder why we want to tackle different machine learning libraries.
However, it is nice that there is so much choice and you are not locked in a single library when using Apache Spark. Open source means openness, and this is just one example of how we are all benefiting from this approach in contrast to a single vendor, single product lock-in. Although recently Apache Spark integrated GraphX, another Apache Spark library into its distribution, we don't expect this will happen too soon. Therefore, it is most likely that Apache Spark as a central data processing platform and additional third-party libraries will co-exist, like Apache Spark being the big data operating system and the third-party libraries are the software you install and run on top of it.
Stream processing is another big and popular topic for Apache Spark. It involves the processing of data in Spark as streams and covers topics such as input and output operations, transformations, persistence, and checkpointing, among others.
Apache Spark Streamingwill cover the area of processing, and we will also see practical examples of different types of stream processing. This discusses batch and window stream configuration and provides a practical example of checkpointing. It alsocovers different examples of stream processing, including Kafka and Flume.
There are many ways in which stream data can be used. Other Spark module functionality (for example, SQL, MLlib, and GraphX) can be used to process the stream. You can use Spark Streaming with systems such as MQTT or ZeroMQ. You can even create custom receivers for your own user-defined data sources.
From Spark version 1.3, data frames have been introduced in Apache Spark so that Spark data can be processed in a tabular form and tabular functions (such as select, filter, and groupBy) can be used to process data. The Spark SQL module integrates with Parquet and JSON formats to allow data to be stored in formats that better represent the data. This also offers more options to integrate with external systems.
The idea of integrating Apache Spark into the Hadoop Hive big data database can also be introduced. Hive context-based Spark applications can be used to manipulate Hive-based table data. This brings Spark's fast in-memory distributed processing to Hive's big data storage capabilities. It effectively lets Hive use Spark as a processing engine.
Additionally, there is an abundance of additional connectors to access NoSQL databases outside the Hadoop ecosystem directly from Apache Spark. In Chapter 2, Apache Spark SQL, we will see how the Cloudant connector can be used to access a remote ApacheCouchDB NoSQL database and issue SQL statements against JSON-based NoSQL document collections.
Graph processing is another very important topic when it comes to data analysis. In fact, a majority of problems can be expressed as a graph.
A graph is basically a network of items and their relationships to each other. Items are called nodes and relationships are called edges. Relationships can be directed or undirected. Relationships, as well as items, can have properties. So a map, for example, can be represented as a graph as well. Each city is a node and the streets between the cities are edges. The distance between the cities can be assigned as properties on the edge.
The Apache Spark GraphX module allows Apache Spark to offer fast big data in-memory graph processing. This allows you to run graph algorithms at scale.
One of the most famous algorithms, for example, is the traveling salesman problem. Consider the graph representation of the map mentioned earlier. A salesman has to visit all cities of a region but wants to minimize the distance that he has to travel. As the distances between all the nodes are stored on the edges, a graph algorithm can actually tell you the optimal route. GraphX is able to create, manipulate, and analyze graphs using a variety of built-in algorithms.
It introduces two new data types to support graph processing in Spark--VertexRDD and EdgeRDD--to represent graph nodes and edges. It also introduces graph processing algorithms, such as PageRank and triangle processing. Many of these functions will be examined in Chapter 11, Apache Spark GraphX and Chapter 12, Apache Spark GraphFrames.
When examining big data processing systems, we think it is important to look at not just the system itself, but also how it can be extended and how it integrates with external systems so that greater levels of functionality can be offered. In a book of this size, we cannot cover every option, but by introducing a topic, we can hopefully stimulate the reader's interest so that they can investigate further.
We have used the H2O machine learning library, SystemML and Deeplearning4j, to extend Apache Spark's MLlib machine learning module. We have shown that Deeplearning and highly performant cost-based optimized machine learning can be introduced to Apache Spark. However, we have just scratched the surface of all the frameworks' functionality.
Since Apache Spark V2, many things have changed. This doesn't mean that the API has been broken. In contrast, most of the V1.6 Apache Spark applications will run on Apache Spark V2 with or without very little changes, but under the hood, there have been a lot of changes.
The first and most interesting thing to mention is the newest functionalities of the Catalyst Optimizer, which we will cover in detail in Chapter 3, The Catalyst Optimizer. Catalyst creates a Logical Execution Plan (LEP) from a SQL query and optimizes this LEP to create multiple Physical Execution Plans (PEPs). Based on statistics, Catalyst chooses the best PEP to execute. This is very similar to cost-based optimizers in Relational Data Base Management Systems (RDBMs). Catalyst makes heavy use of Project Tungsten, a component that we will cover inChapter 4, Apache Spark Streaming.
Although the Java Virtual Machine (JVM) is a masterpiece on its own, it is a general-purpose byte code execution engine. Therefore, there is a lot of JVM object management and garbage collection (GC) overhead. So, for example, to store a 4-byte string, 48 bytes on the JVM are needed. The GC optimizes on object lifetime estimation, but Apache Spark often knows this better than JVM. Therefore, Tungsten disables the JVM GC for a subset of privately managed data structures to make them L1/L2/L3 Cache-friendly.
In addition, code generation removed the boxing of primitive types polymorphic function dispatching. Finally, a new first-class citizen called Dataset unified the RDD and DataFrame APIs. Datasets are statically typed and avoid runtime type errors. Therefore, Datasets can be used only with Java and Scala. This means that Python and R users still have to stick to DataFrames, which are kept in Apache Spark V2 for backward compatibility reasons.
As we have already mentioned, Apache Spark is a distributed, in-memory, parallel processing system, which needs an associated storage system. So, when you build a big data cluster, you will probably use a distributed storage system such as Hadoop, as well as tools to move data such as Sqoop, Flume, and Kafka.
We wanted to introduce the idea of edge nodes in a big data cluster. These nodes in the cluster will be client-facing, on which reside the client-facing components such as Hadoop NameNode or perhaps the Spark master. Majority of the big data cluster might be behind a firewall. The edge nodes would then reduce the complexity caused by the firewall as they would be the only points of contact accessible from outside. The following figure shows a simplified big data cluster:
It shows five simplified cluster nodes with executor JVMs, one per CPU core, and the Spark Driver JVM sitting outside the cluster. In addition, you see the disk directly attached to the nodes. This is called the JBOD (just a bunch of disks) approach. Very large files are partitioned over the disks and a virtual filesystem such as HDFS makes these chunks available as one large virtual file. This is, of course, stylized and simplified, but you get the idea.
The following simplified component model shows the driver JVM sitting outside the cluster. It talks to the Cluster Manager in order to obtain permission to schedule tasks on the worker nodes because the Cluster Manager keeps track of resource allocation of all processes running on the cluster.
As we will see later, there is a variety of different cluster managers, some of them also capable of managing other Hadoop workloads or even non-Hadoop applications in parallel to the Spark Executors. Note that the Executor and Driver have bidirectional communication all the time, so network-wise, they should also be sitting close together:
Generally, firewalls, while adding security to the cluster, also increase the complexity. Ports between system components need to be opened up so that they can talk to each other. For instance, Zookeeper is used by many components for configuration. Apache Kafka, the publish/subscribe messaging system, uses Zookeeper to configure its topics, groups, consumers, and producers. So, client ports to Zookeeper, potentially across the firewall, need to be open.
Finally, the allocation of systems to cluster nodes needs to be considered. For instance, if Apache Spark uses Flume or Kafka, then in-memory channels will be used. The size of these channels, and the memory used, caused by the data flow, need to be considered. Apache Spark should not be competing with other Apache components for memory usage. Depending upon your data flows and memory usage, it might be necessary to have Spark, Hadoop, Zookeeper, Flume, and other tools on distinct cluster nodes. Alternatively, resource managers such as YARN, Mesos, or Docker can be used to tackle this problem. In standard Hadoop environments, YARN is most likely.
Generally, the edge nodes that act as cluster NameNode servers or Spark master servers will need greater resources than the cluster processing nodes within the firewall. When many Hadoop ecosystem components are deployed on the cluster, all of them will need extra memory on the master server. You should monitor edge nodes for resource usage and adjust in terms of resources and/or application location as necessary. YARN, for instance, is taking care of this.
This section has briefly set the scene for the big data cluster in terms of Apache Spark, Hadoop, and other tools. However, how might the Apache Spark cluster itself, within the big data cluster, be configured? For instance, it is possible to have many types of the Spark cluster manager. The next section will examine this and describe each type of the Apache Spark cluster manager.
The Spark context, as you will see in many of the examples in this book, can be defined via a Spark configuration object and Spark URL. The Spark context connects to the Spark cluster manager, which then allocates resources across the worker nodes for the application. The cluster manager allocates executors across the cluster worker nodes. It copies the application JAR file to the workers and finally allocates tasks.
The following subsections describe the possible Apache Spark cluster manager options available at this time.
By specifying a Spark configuration local URL, it is possible to have the application run locally. By specifying local[n], it is possible to have Spark use n threads to run the application locally. This is a useful development and test option because you can also test some sort of parallelization scenarios but keep all log files on a single machine.
Standalone mode uses a basic cluster manager that is supplied with Apache Spark. The spark master URL will be as follows:
Spark://<hostname>:7077
Here,<hostname>is the name of the host on which the Spark master is running. We have specified7077as the port, which is the default value, but this is configurable. This simple cluster manager currently supports only FIFO (first-in first-out) scheduling. You can contrive to allow concurrent application scheduling by setting the resource configuration options for each application; for instance, usingspark.core.maxto share cores between applications.
At a larger scale, when integrating with Hadoop YARN, the Apache Spark cluster manager can be YARN and the application can run in one of two modes. If the Spark master value is set as yarn-cluster, then the application can be submitted to the cluster and then terminated. The cluster will take care of allocating resources and running tasks. However, if the application master is submitted as yarn-client, then the application stays alive during the life cycle of processing, and requests resources from YARN.
Apache Mesos is an open source system for resource sharing across a cluster. It allows multiple frameworks to share a cluster by managing and scheduling resources. It is a cluster manager that provides isolation using Linux containers and allowing multiple systems such as Hadoop, Spark, Kafka, Storm, and more to share a cluster safely. It is highly scalable to thousands of nodes. It is a master/slave-based system and is fault tolerant, using Zookeeper for configuration management.
For a single master node Mesos cluster, the Spark master URL will be in this form:
mesos://<hostname>:5050.
Here, <hostname> is the hostname of the Mesos master server; the port is defined as 5050, which is the default Mesos master port (this is configurable). If there are multiple Mesos master servers in a large-scale high availability Mesos cluster, then the Spark master URL would look as follows:
mesos://zk://<hostname>:2181.
So, the election of the Mesos master server will be controlled by Zookeeper. The <hostname> will be the name of a host in the Zookeeper quorum. Also, the port number, 2181, is the default master port for Zookeeper.
There are three different abstraction levels of cloud systems--Infrastructure as a Service (IaaS), Platform as a Service (PaaS), and Software as a Service (SaaS). We will see how to use and install Apache Spark on all of these.
The new way to do IaaS is Docker and Kubernetes as opposed to virtual machines, basically providing a way to automatically set up an Apache Spark cluster within minutes. This will be covered inChapter 14, Apache Spark on Kubernetes.The advantage of Kubernetes is that it can be used among multiple different cloud providers as it is an open standard and also based on open source.
You even can use Kubernetes in a local data center and transparently and dynamically move workloads between local, dedicated, and public cloud data centers. PaaS, in contrast, takes away from you the burden of installing and operating an Apache Spark cluster because this is provided as a service.
There is an ongoing discussion whether Docker is IaaS or PaaS but, in our opinion, this is just a form of a lightweight preinstalled virtual machine. We will cover more on PaaS inChapter 13, Apache Spark with Jupyter Notebooks on IBM DataScience Experience.This is particularly interesting because the offering is completely based on open source technologies, which enables you to replicate the system on any other data center.
One of the open source components we'll introduce is Jupyter notebooks, a modern way to do data science in a cloud based collaborative environment. But in addition to Jupyter, there is also Apache Zeppelin, which we'll cover briefly in Chapter 14, Apache Spark on Kubernetes.
Before moving on to the rest of the chapters covering functional areas of Apache Spark and extensions, we will examine the area of performance. What issues and areas need to be considered? What might impact the Spark application performance starting at the cluster level and finishing with actual Scala code? We don't want to just repeat what the Spark website says, so take a look at this URL:http://spark.apache.org/docs/<version>/tuning.html.
Here, <version> relates to the version of Spark that you are using; that is, either the latest or something like 1.6.1 for a specific version. So, having looked at this page, we will briefly mention some of the topic areas. We will list some general points in this section without implying an order of importance.
The size and structure of your big data cluster is going to affect performance. If you have a cloud-based cluster, your IO and latency will suffer in comparison to an unshared hardware cluster. You will be sharing the underlying hardware with multiple customers and the cluster hardware may be remote.There are some exceptions to this. The IBM cloud, for instance, offers dedicated bare metal high performance cluster nodes with an InfiniBand network connection, which can be rented on an hourly basis.
Additionally, the positioning of cluster components on servers may cause resource contention. For instance, think carefully about locating Hadoop NameNodes, Spark servers, Zookeeper, Flume, and Kafka servers in large clusters. With high workloads, you might consider segregating servers to individual systems. You might also consider using an Apache system such as Mesos thatprovides better distributions and assignment of resources to the individual processes.
Consider potential parallelism as well. The greater the number of workers in your Spark cluster for large Datasets, the greater the opportunity for parallelism. One rule of thumb is one worker per hyper-thread or virtual core respectively.
You might consider using an alternative to HDFS, depending upon your cluster requirements. For instance, IBM has the GPFS (General Purpose File System) for improved performance.
The reason why GPFS might be a better choice is that, coming from the high performance computing background, this filesystem has a full read write capability, whereas HDFS is designed as a write once, read many filesystem. It offers an improvement in performance over HDFS because it runs at the kernel level as opposed to HDFS, which runs in a Java Virtual Machine (JVM) that in turn runs as an operating system process. It also integrates with Hadoop and the Spark cluster tools. IBM runs setups with several hundred petabytes using GPFS.
Another commercial alternative is the MapR file system that, besides performance improvements, supports mirroring, snapshots, and high availability.
Ceph is an open source alternative to a distributed, fault-tolerant, and self-healing filesystem for commodity hard drives like HDFS. It runs in the Linux kernel as well and addresses many of the performance issues that HDFS has. Other promising candidates in this space are Alluxio (formerly Tachyon), Quantcast, GlusterFS, and Lustre.
Finally, Cassandra is not a filesystem but a NoSQL key value store and is tightly integrated with Apache Spark and is therefore traded as a valid and powerful alternative to HDFS--or even to any other distributed filesystem--especially as it supports predicate push-down using ApacheSparkSQL and the Catalyst optimizer, which we will cover in the following chapters.
The key for good data processing performance is avoidance of network transfers. This was very true a couple of years ago but is less relevant for tasks with high demands on CPU and low I/O, but for low demand on CPU and high I/O demand data processing algorithms, this still holds.
Another way to achieve data locality is using ApacheSparkSQL. Depending on the connector implementation, SparkSQL can make use of data processing capabilities of the source engine. So for example when using MongoDB in conjunction with SparkSQL parts of the SQL statement are preprocessed by MongoDB before data is sent upstream to Apache Spark.
In order to avoid OOM (Out of Memory) messages for the tasks on your Apache Spark cluster, please consider a number of questions for the tuning:
Consider the level of physical memory available on your Spark worker nodes. Can it be increased? Check on the memory consumption of operating system processes during high workloads in order to get an idea of free memory. Make sure that the workers have enough memory.
Consider data partitioning. Can you increase the number of partitions? As a rule of thumb, you should have at least as many partitions as you have available CPU cores on the cluster. Use the
repartition
function on the RDD API.
Can you modify the storage fraction and the memory used by the JVM for storage and caching of RDDs? Workers are competing for memory against data storage. Use the
Storage
page on the Apache Spark user interface to see if this fraction is set to an optimal value. Then update the following properties:
spark.memory.fraction
spark.memory.storageFraction
spark.memory.offHeap.enabled=true
spark.memory.offHeap.size
In addition, the following two things can be done in order to improve performance:
Consider using Parquet as a storage format, which is much more storage effective than CSV or JSON
Consider using the DataFrame/Dataset API instead of the RDD API as it might resolve in more effective executions (more about this in the next three chapters)
