28,14 €
A practical guide for solving complex data processing challenges by applying the best optimizations techniques in Apache Spark.
Key Features
Book Description
Apache Spark is a flexible framework that allows processing of batch and real-time data. Its unified engine has made it quite popular for big data use cases. This book will help you to get started with Apache Spark 2.0 and write big data applications for a variety of use cases.
It will also introduce you to Apache Spark – one of the most popular Big Data processing frameworks. Although this book is intended to help you get started with Apache Spark, but it also focuses on explaining the core concepts.
This practical guide provides a quick start to the Spark 2.0 architecture and its components. It teaches you how to set up Spark on your local machine. As we move ahead, you will be introduced to resilient distributed datasets (RDDs) and DataFrame APIs, and their corresponding transformations and actions. Then, we move on to the life cycle of a Spark application and learn about the techniques used to debug slow-running applications. You will also go through Spark's built-in modules for SQL, streaming, machine learning, and graph analysis.
Finally, the book will lay out the best practices and optimization techniques that are key for writing efficient Spark applications. By the end of this book, you will have a sound fundamental understanding of the Apache Spark framework and you will be able to write and optimize Spark applications.
What you will learn
Who this book is for
If you are a big data enthusiast and love processing huge amount of data, this book is for you. If you are data engineer and looking for the best optimization techniques for your Spark applications, then you will find this book helpful. This book also helps data scientists who want to implement their machine learning algorithms in Spark. You need to have a basic understanding of any one of the programming languages such as Scala, Python or Java.
Das E-Book können Sie in Legimi-Apps oder einer beliebigen App lesen, die das folgende Format unterstützen:
Seitenzahl: 148
Veröffentlichungsjahr: 2019
Copyright © 2019 Packt Publishing
All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.
Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the authors, nor Packt Publishing or its dealers and distributors, will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book.
Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.
Commissioning Editor: Amey VarangaonkarAcquisition Editor: Siddharth MandalContent Development Editor: Smit CarvalhoTechnical Editor: Aishwarya MoreCopy Editor: Safis EditingProject Coordinator: Pragati ShuklaProofreader: Safis EditingIndexer: Pratik ShirodkarGraphics: Alishon MendonsaProduction Coordinator: Deepika Naik
First published: January 2019
Production reference: 1310119
Published by Packt Publishing Ltd. Livery Place 35 Livery Street Birmingham B3 2PB, UK.
ISBN 978-1-78934-910-8
www.packtpub.com
Mapt is an online digital library that gives you full access to over 5,000 books and videos, as well as industry leading tools to help you plan your personal development and advance your career. For more information, please visit our website.
Spend less time learning and more time coding with practical eBooks and videos from over 4,000 industry professionals
Improve your learning with Skill Plans built especially for you
Get a free eBook or video every month
Mapt is fully searchable
Copy and paste, print, and bookmark content
Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.packt.com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at [email protected] for more details.
At www.packt.com, you can also read a collection of free technical articles, sign up for a range of free newsletters, and receive exclusive discounts and offers on Packt books and eBooks.
Shrey Mehrotra has over 8 years of IT experience and, for the past 6 years, has been designing the architecture of cloud and big-data solutions for the finance, media, and governance sectors. Having worked on research and development with big-data labs and been part of Risk Technologies, he has gained insights into Hadoop, with a focus on Spark, HBase, and Hive. His technical strengths also include Elasticsearch, Kafka, Java, YARN, Sqoop, and Flume. He likes spending time performing research and development on different big-data technologies. He is the coauthor of the books Learning YARN and Hive Cookbook, a certified Hadoop developer, and he has also written various technical papers.
Akash Grade is a data engineer living in New Delhi, India. Akash graduated with a BSc in computer science from the University of Delhi in 2011, and later earned an MSc in software engineering from BITS Pilani. He spends most of his time designing highly scalable data pipeline using big-data solutions such as Apache Spark, Hive, and Kafka. Akash is also a Databricks-certified Spark developer. He has been working on Apache Spark for the last five years, and enjoys writing applications in Python, Go, and SQL.
Nisith Kumar Nanda is a passionate big data consultant who loves to find solutions to complex data problems. He has around 10 years of IT experience working on multiple technologies with various clients globally. His core expertise involves working with open source big data technologies such as Apache Spark, Kafka, Cassandra, HBase, to build critical next generation real-time and batch applications. He is very proficient in various programming languages such as Java, Scala, and Python. He is passionate about AI, machine learning, and NLP.
If you're interested in becoming an author for Packt, please visit authors.packtpub.com and apply today. We have worked with thousands of developers and tech professionals, just like you, to help them share their insight with the global tech community. You can make a general application, apply for a specific hot topic that we are recruiting an author for, or submit your own idea.
Title Page
Copyright and Credits
Apache Spark Quick Start Guide
About Packt
Why subscribe?
Packt.com
Contributors
About the authors
About the reviewer
Packt is searching for authors like you
Preface
Who this book is for
What this book covers
To get the most out of this book
Download the example code files
Download the color images
Conventions used
Get in touch
Reviews
Introduction to Apache Spark
What is Spark?
Spark architecture overview
Spark language APIs
Scala
Java
Python
R
SQL
Spark components
Spark Core
Spark SQL
Spark Streaming
Spark machine learning
Spark graph processing
Cluster manager
Standalone scheduler
YARN
Mesos
Kubernetes
Making the most of Hadoop and Spark
Summary
Apache Spark Installation
AWS elastic compute cloud (EC2)
Creating a free account on AWS
Connecting to your Linux instance
Configuring Spark
Prerequisites
Installing Java
Installing Scala
Installing Python
Installing Spark
Using Spark components
Different modes of execution
Spark sandbox
Summary
Spark RDD
What is an RDD?
Resilient metadata
Programming using RDDs
Transformations and actions
Transformation
Narrow transformations
map()
flatMap()
filter()
union()
mapPartitions()
Wide transformations
distinct()
sortBy()
intersection()
subtract()
cartesian()
Action
collect()
count()
take()
top()
takeOrdered()
first()
countByValue()
reduce()
saveAsTextFile()
foreach()
Types of RDDs
Pair RDDs
groupByKey()
reduceByKey()
sortByKey()
join()
Caching and checkpointing
Caching
Checkpointing 
Understanding partitions 
repartition() versus coalesce()
partitionBy()
Drawbacks of using RDDs
Summary
Spark DataFrame and Dataset
DataFrames
Creating DataFrames
Data sources
DataFrame operations and associated functions
Running SQL on DataFrames
Temporary views on DataFrames
Global temporary views on DataFrames
Datasets
Encoders
Internal row
Creating custom encoders
Summary
Spark Architecture and Application Execution Flow
A sample application
DAG constructor
Stage
Tasks
Task scheduler
FIFO
FAIR
Application execution modes
Local mode
Client mode
Cluster mode
Application monitoring
Spark UI
Application logs
External monitoring solution
Summary
Spark SQL
Spark SQL
Spark metastore
Using the Hive metastore in Spark SQL
Hive configuration with Spark
SQL language manual
Database
Table and view
Load data
Creating UDFs
SQL database using JDBC
Summary
Spark Streaming, Machine Learning, and Graph Analysis
Spark Streaming
Use cases
Data sources
Stream processing
Microbatch
DStreams
Streaming architecture
Streaming example
Machine learning
MLlib
ML
Graph processing
GraphX
mapVertices
mapEdges
subgraph
GraphFrames
degrees
subgraphs
Graph algorithms
PageRank
Summary
Spark Optimizations
Cluster-level optimizations
Memory
Disk
CPU cores
Project Tungsten
Application optimizations
Language choice
Structured versus unstructured APIs
File format choice
RDD optimizations
Choosing the right transformations
Serializing and compressing 
Broadcast variables
DataFrame and dataset optimizations
Catalyst optimizer
Storage 
Parallelism 
Join performance
Code generation 
Speculative execution
Summary
Other Books You May Enjoy
Leave a review - let other readers know what you think
Apache Spark is a flexible in-memory framework that allows the processing of both batch and real-time data in a distributed way. Its unified engine has made it quite popular for big data use cases.
This book will help you to quickly get started with Apache Spark 2.x and help you write efficient big data applications for a variety of use cases. You will get to grip with the low-level details as well as core concepts of Apache Spark, and the way they can be used to solve big data problems. You will be introduced to RDD and DataFrame APIs, and their corresponding transformations and actions.
This book will help you learn Spark's components for machine learning, stream processing, and graph analysis. At the end of the book, you'll learn different optimization techniques for writing efficient Spark code.
If you are a big data enthusiast and love processing huge amounts of data, this book is for you. If you are a data engineer and looking for the best optimization techniques for your Spark applications, then you will find this book helpful. This book will also help data scientists who want to implement their machine learning algorithms in Spark. You need to have a basic understanding of programming languages such as Scala, Python, or Java.
Chapter 1, Introduction to Apache Spark, provides an introduction to Spark 2.0. It provides a brief description of different Spark components, including Spark Core, Spark SQL, Spark Streaming, machine learning, and graph processing. It also discusses the advantages of Spark compared to other similar frameworks.
Chapter 2, Apache Spark Installation,provides a step-by-step guide to installing Spark on an AWS EC2 instance from scratch. It also helps you install all the prerequisites, such as Python, Java, and Scala.
Chapter 3, Spark RDD, explains Resilient Distributed Datasets (RDD) APIs, which are the heart of Apache Spark. It also discusses various transformations and actions that can be applied on an RDD.
Chapter 4, Spark DataFrame and Dataset, covers Spark's structured APIs: DataFrame and Dataset. This chapter also covers various operations that can be performed on a DataFrame or Dataset.
Chapter 5, Spark Architecture and Application Execution Flow,explains the interaction between different services involved in Spark application execution. It explains the role of worker nodes, executors, and drivers in application execution in both client and cluster mode. It also explains how Spark creates a Directed Acyclic Graph (DAG) that consists of stages and tasks.
Chapter 6, Spark SQL, discusses how Spark gracefully supports all SQL operations by providing a Spark-SQL interface and various DataFrame APIs. It also covers the seamless integration of Spark with the Hive metastore.
Chapter 7, Spark Streaming, Machine Learning, and Graph Analysis, explores different Spark APIs for working with real-time data streams, machine learning, and graphs. It explains the candidature of features based on the use case requirements.
Chapter 8, Spark Optimizations, covers different optimization techniques to improve the performance of your Spark applications. It explains how you can use resources such as executors and memory in order to better parallelize your tasks.
Use a machine with a recent version of Linux or macOS. It will be useful to know the basic syntax of Scala, Python, and Java. Install Python's NumPy package in order to work with Spark's machine learning packages.
You can download the example code files for this book from your account at www.packt.com. If you purchased this book elsewhere, you can visit www.packt.com/support and register to have the files emailed directly to you.
You can download the code files by following these steps:
Log in or register at
www.packt.com
Select the
SUPPORT
tab
Click on
Code Downloads and Errata
Enter the name of the book in the
Search
box and follow the onscreen instructions
Once the file is downloaded, please make sure that you unzip or extract the folder using the latest version of:
WinRAR/7-Zip for Windows
Zipeg/iZip/UnRarX for Mac
7-Zip/PeaZip for Linux
The code bundle for the book is also hosted on GitHub at https://github.com/PacktPublishing/Apache-Spark-Quick-Start-Guide. In case there's an update to the code, it will be updated on the existing GitHub repository.
We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/. Check them out!
We also provide a PDF file that has color images of the screenshots/diagrams used in this book. You can download it here: https://www.packtpub.com/sites/default/files/downloads/9781789349108_ColorImages.pdf.
There are a number of text conventions used throughout this book.
CodeInText: Indicates code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles. Here is an example: "Mount the downloaded WebStorm-10*.dmg disk image file as another disk in your system."
Any command-line input or output is written as follows:
$ mkdir css
$ cd css
Bold: Indicates a new term, an important word, or words that you see onscreen. For example, words in menus or dialog boxes appear in the text like this. Here is an example: "Select System info from the Administration panel."
Feedback from our readers is always welcome:
General feedback: If you have questions about any aspect of this book, mention the book title in the subject of your message and email us at [email protected].
Errata: Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you have found a mistake in this book, we would be grateful if you would report this to us. Please visit www.packt.com/submit-errata, selecting your book, clicking on the Errata Submission Form link, and entering the details.
Piracy: If you come across any illegal copies of our works in any form on the Internet, we would be grateful if you would provide us with the location address or website name. Please contact us at [email protected] with a link to the material.
If you are interested in becoming an author: If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, please visit authors.packtpub.com.
Please leave a review. Once you have read and used this book, why not leave a review on the site that you purchased it from? Potential readers can then see and use your unbiased opinion to make purchase decisions, we at Packt can understand what you think about our products, and our authors can see your feedback on their book. Thank you!
For more information about Packt, please visit packt.com.
Apache Spark is an open source framework for processing large datasets stored in heterogeneous data stores in an efficient and fast way.Sophisticated analytical algorithms can be easily executed on these large datasets. Spark can execute a distributed program 100 times faster than MapReduce. As Spark is one of the fast-growing projects in the open source community, it provides a large number of libraries to its users.
We shall cover the following topics in this chapter:
A brief introduction to Spark
Spark architecture and the different languages that can be used for coding Spark applications
Spark components and how these components can be used together to solve a variety of use cases
A comparison between Spark and Hadoop
Apache Spark is a distributed computing framework which makes big-data processing quite easy, fast, and scalable. You must be wondering what makes Spark so popular in the industry, and how is it really different than the existing tools available for big-data processing? The reason is that it provides a unified stack for processing all different kinds of big data, be it batch, streaming, machine learning, or graph data.
Spark was developed at UC Berkeley’s AMPLab in 2009 and later came under the Apache Umbrella in 2010. The framework is mainly written in Scala and Java.
Spark provides an interface with many different distributed and non-distributed data stores, such as Hadoop Distributed File System (HDFS), Cassandra, Openstack Swift, Amazon S3, and Kudu. It also provides a wide variety of language APIs to perform analytics on the data stored in these data stores. These APIs include Scala, Java, Python, and R.
The basic entity of Spark is Resilient Distributed Dataset (RDD), which is a read-only partitioned collection of data. RDD can be created using data stored on different data stores or using existing RDD. We shall discuss this in more detail in Chapter 3, Spark RDD.
