E-Book
28,14 €

Apache Spark Quick Start Guide E-Book

Shrey Mehrotra

0,0

28,14 €

Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.

Mehr erfahren.

Herausgeber: Packt Publishing
Kategorie: Wissenschaft und neue Technologien
Sprache: Englisch

Beschreibung

A practical guide for solving complex data processing challenges by applying the best optimizations techniques in Apache Spark.

Key Features

Learn about the core concepts and the latest developments in Apache Spark

Master writing efficient big data applications with Spark's built-in modules for SQL, Streaming, Machine Learning and Graph analysis

Get introduced to a variety of optimizations based on the actual experience

Book Description

Apache Spark is a flexible framework that allows processing of batch and real-time data. Its unified engine has made it quite popular for big data use cases. This book will help you to get started with Apache Spark 2.0 and write big data applications for a variety of use cases.

It will also introduce you to Apache Spark – one of the most popular Big Data processing frameworks. Although this book is intended to help you get started with Apache Spark, but it also focuses on explaining the core concepts.

This practical guide provides a quick start to the Spark 2.0 architecture and its components. It teaches you how to set up Spark on your local machine. As we move ahead, you will be introduced to resilient distributed datasets (RDDs) and DataFrame APIs, and their corresponding transformations and actions. Then, we move on to the life cycle of a Spark application and learn about the techniques used to debug slow-running applications. You will also go through Spark's built-in modules for SQL, streaming, machine learning, and graph analysis.

Finally, the book will lay out the best practices and optimization techniques that are key for writing efficient Spark applications. By the end of this book, you will have a sound fundamental understanding of the Apache Spark framework and you will be able to write and optimize Spark applications.

What you will learn

Learn core concepts such as RDDs, DataFrames, transformations, and more

Set up a Spark development environment

Choose the right APIs for your applications

Understand Spark's architecture and the execution flow of a Spark application

Explore built-in modules for SQL, streaming, ML, and graph analysis

Optimize your Spark job for better performance

Who this book is for

If you are a big data enthusiast and love processing huge amount of data, this book is for you. If you are data engineer and looking for the best optimization techniques for your Spark applications, then you will find this book helpful. This book also helps data scientists who want to implement their machine learning algorithms in Spark. You need to have a basic understanding of any one of the programming languages such as Scala, Python or Java.

Details

Das E-Book können Sie in Legimi-Apps oder einer beliebigen App lesen, die das folgende Format unterstützen:

EPUB

Seitenzahl: 148

Veröffentlichungsjahr: 2019

Bewertungen

0,0

Rezensionen(0 Rezensionen)

Leseprobe

Apache Spark Quick Start Guide

Quickly learn the art of writing efficient big data applications with Apache Spark

Shrey Mehrotra

Akash Grade

BIRMINGHAM - MUMBAI

Apache Spark Quick Start Guide

All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the authors, nor Packt Publishing or its dealers and distributors, will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book.

Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

Commissioning Editor: Amey VarangaonkarAcquisition Editor: Siddharth MandalContent Development Editor: Smit CarvalhoTechnical Editor: Aishwarya MoreCopy Editor: Safis EditingProject Coordinator: Pragati ShuklaProofreader: Safis EditingIndexer: Pratik ShirodkarGraphics: Alishon MendonsaProduction Coordinator: Deepika Naik

First published: January 2019

Production reference: 1310119

Published by Packt Publishing Ltd. Livery Place 35 Livery Street Birmingham B3 2PB, UK.

ISBN 978-1-78934-910-8

www.packtpub.com

mapt.io

Mapt is an online digital library that gives you full access to over 5,000 books and videos, as well as industry leading tools to help you plan your personal development and advance your career. For more information, please visit our website.

Why subscribe?

Spend less time learning and more time coding with practical eBooks and videos from over 4,000 industry professionals

Improve your learning with Skill Plans built especially for you

Get a free eBook or video every month

Mapt is fully searchable

Copy and paste, print, and bookmark content

Packt.com

Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.packt.com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at [email protected] for more details.

At www.packt.com, you can also read a collection of free technical articles, sign up for a range of free newsletters, and receive exclusive discounts and offers on Packt books and eBooks.

Contributors

About the authors

Shrey Mehrotra has over 8 years of IT experience and, for the past 6 years, has been designing the architecture of cloud and big-data solutions for the finance, media, and governance sectors. Having worked on research and development with big-data labs and been part of Risk Technologies, he has gained insights into Hadoop, with a focus on Spark, HBase, and Hive. His technical strengths also include Elasticsearch, Kafka, Java, YARN, Sqoop, and Flume. He likes spending time performing research and development on different big-data technologies. He is the coauthor of the books Learning YARN and Hive Cookbook, a certified Hadoop developer, and he has also written various technical papers.

Akash Grade is a data engineer living in New Delhi, India. Akash graduated with a BSc in computer science from the University of Delhi in 2011, and later earned an MSc in software engineering from BITS Pilani. He spends most of his time designing highly scalable data pipeline using big-data solutions such as Apache Spark, Hive, and Kafka. Akash is also a Databricks-certified Spark developer. He has been working on Apache Spark for the last five years, and enjoys writing applications in Python, Go, and SQL.

About the reviewer

Nisith Kumar Nanda is a passionate big data consultant who loves to find solutions to complex data problems. He has around 10 years of IT experience working on multiple technologies with various clients globally. His core expertise involves working with open source big data technologies such as Apache Spark, Kafka, Cassandra, HBase, to build critical next generation real-time and batch applications. He is very proficient in various programming languages such as Java, Scala, and Python. He is passionate about AI, machine learning, and NLP.

I would like to thank my family and especially my wife, Samita, for their support. I will also take this opportunity to thank my friends and colleagues who helped me to grow professionally.

Packt is searching for authors like you

If you're interested in becoming an author for Packt, please visit authors.packtpub.com and apply today. We have worked with thousands of developers and tech professionals, just like you, to help them share their insight with the global tech community. You can make a general application, apply for a specific hot topic that we are recruiting an author for, or submit your own idea.

Title Page

Apache Spark Quick Start Guide

About Packt

Why subscribe?

Packt.com

Contributors

About the authors

About the reviewer

Packt is searching for authors like you

Preface

Who this book is for

What this book covers

To get the most out of this book

Download the example code files

Download the color images

Conventions used

Get in touch

Reviews

Introduction to Apache Spark

What is Spark?

Spark architecture overview

Spark language APIs

Scala

Java

Python

SQL

Spark components

Spark Core

Spark SQL

Spark Streaming

Spark machine learning

Spark graph processing

Cluster manager

Standalone scheduler

YARN

Mesos

Kubernetes

Making the most of Hadoop and Spark

Summary

Apache Spark Installation

AWS elastic compute cloud (EC2)

Creating a free account on AWS

Connecting to your Linux instance

Configuring Spark

Prerequisites

Installing Java

Installing Scala

Installing Python

Installing Spark

Using Spark components

Different modes of execution

Spark sandbox

Summary

Spark RDD

What is an RDD?

Resilient metadata

Programming using RDDs

Transformations and actions

Transformation

Narrow transformations

map()

flatMap()

filter()

union()

mapPartitions()

Wide transformations

distinct()

sortBy()

intersection()

subtract()

cartesian()

Action

collect()

count()

take()

top()

takeOrdered()

first()

countByValue()

reduce()

saveAsTextFile()

foreach()

Types of RDDs

Pair RDDs

groupByKey()

reduceByKey()

sortByKey()

join()

Caching and checkpointing

Caching

Checkpointing 

Understanding partitions 

repartition() versus coalesce()

partitionBy()

Drawbacks of using RDDs

Summary

Spark DataFrame and Dataset

DataFrames

Creating DataFrames

Data sources

DataFrame operations and associated functions

Running SQL on DataFrames

Temporary views on DataFrames

Global temporary views on DataFrames

Datasets

Encoders

Internal row

Creating custom encoders

Summary

Spark Architecture and Application Execution Flow

A sample application

DAG constructor

Stage

Tasks

Task scheduler

FIFO

FAIR

Application execution modes

Local mode

Client mode

Cluster mode

Application monitoring

Spark UI

Application logs

External monitoring solution

Summary

Spark SQL

Spark metastore

Using the Hive metastore in Spark SQL

Hive configuration with Spark

SQL language manual

Database

Table and view

Load data

Creating UDFs

SQL database using JDBC

Summary

Spark Streaming, Machine Learning, and Graph Analysis

Spark Streaming

Use cases

Data sources

Stream processing

Microbatch

DStreams

Streaming architecture

Streaming example

Machine learning

MLlib

Graph processing

GraphX

mapVertices

mapEdges

subgraph

GraphFrames

degrees

subgraphs

Graph algorithms

PageRank

Summary

Spark Optimizations

Cluster-level optimizations

Memory

Disk

CPU cores

Project Tungsten

Application optimizations

Language choice

Structured versus unstructured APIs

File format choice

RDD optimizations

Choosing the right transformations

Serializing and compressing 

Broadcast variables

DataFrame and dataset optimizations

Catalyst optimizer

Storage 

Parallelism 

Join performance

Code generation 

Speculative execution

Summary

Other Books You May Enjoy

Leave a review - let other readers know what you think

Preface

Apache Spark is a flexible in-memory framework that allows the processing of both batch and real-time data in a distributed way. Its unified engine has made it quite popular for big data use cases.

This book will help you to quickly get started with Apache Spark 2.x and help you write efficient big data applications for a variety of use cases. You will get to grip with the low-level details as well as core concepts of Apache Spark, and the way they can be used to solve big data problems. You will be introduced to RDD and DataFrame APIs, and their corresponding transformations and actions.

This book will help you learn Spark's components for machine learning, stream processing, and graph analysis. At the end of the book, you'll learn different optimization techniques for writing efficient Spark code.

Who this book is for

If you are a big data enthusiast and love processing huge amounts of data, this book is for you. If you are a data engineer and looking for the best optimization techniques for your Spark applications, then you will find this book helpful. This book will also help data scientists who want to implement their machine learning algorithms in Spark. You need to have a basic understanding of programming languages such as Scala, Python, or Java.

What this book covers

Chapter 1, Introduction to Apache Spark, provides an introduction to Spark 2.0. It provides a brief description of different Spark components, including Spark Core, Spark SQL, Spark Streaming, machine learning, and graph processing. It also discusses the advantages of Spark compared to other similar frameworks.

Chapter 2, Apache Spark Installation,provides a step-by-step guide to installing Spark on an AWS EC2 instance from scratch. It also helps you install all the prerequisites, such as Python, Java, and Scala.

Chapter 3, Spark RDD, explains Resilient Distributed Datasets (RDD) APIs, which are the heart of Apache Spark. It also discusses various transformations and actions that can be applied on an RDD.

Chapter 4, Spark DataFrame and Dataset, covers Spark's structured APIs: DataFrame and Dataset. This chapter also covers various operations that can be performed on a DataFrame or Dataset.

Chapter 5, Spark Architecture and Application Execution Flow,explains the interaction between different services involved in Spark application execution. It explains the role of worker nodes, executors, and drivers in application execution in both client and cluster mode. It also explains how Spark creates a Directed Acyclic Graph (DAG) that consists of stages and tasks.

Chapter 6, Spark SQL, discusses how Spark gracefully supports all SQL operations by providing a Spark-SQL interface and various DataFrame APIs. It also covers the seamless integration of Spark with the Hive metastore.

Chapter 7, Spark Streaming, Machine Learning, and Graph Analysis, explores different Spark APIs for working with real-time data streams, machine learning, and graphs. It explains the candidature of features based on the use case requirements.

Chapter 8, Spark Optimizations, covers different optimization techniques to improve the performance of your Spark applications. It explains how you can use resources such as executors and memory in order to better parallelize your tasks.

To get the most out of this book

Use a machine with a recent version of Linux or macOS. It will be useful to know the basic syntax of Scala, Python, and Java. Install Python's NumPy package in order to work with Spark's machine learning packages.

Download the example code files

You can download the example code files for this book from your account at www.packt.com. If you purchased this book elsewhere, you can visit www.packt.com/support and register to have the files emailed directly to you.

You can download the code files by following these steps:

www.packt.com

Select the

SUPPORT

tab

Click on

Code Downloads and Errata

Enter the name of the book in the

box and follow the onscreen instructions

Once the file is downloaded, please make sure that you unzip or extract the folder using the latest version of:

WinRAR/7-Zip for Windows

Zipeg/iZip/UnRarX for Mac

7-Zip/PeaZip for Linux

The code bundle for the book is also hosted on GitHub at https://github.com/PacktPublishing/Apache-Spark-Quick-Start-Guide. In case there's an update to the code, it will be updated on the existing GitHub repository.

We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/. Check them out!

Download the color images

We also provide a PDF file that has color images of the screenshots/diagrams used in this book. You can download it here: https://www.packtpub.com/sites/default/files/downloads/9781789349108_ColorImages.pdf.

Conventions used

There are a number of text conventions used throughout this book.

CodeInText: Indicates code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles. Here is an example: "Mount the downloaded WebStorm-10*.dmg disk image file as another disk in your system."

Any command-line input or output is written as follows:

$ mkdir css

$ cd css

Bold: Indicates a new term, an important word, or words that you see onscreen. For example, words in menus or dialog boxes appear in the text like this. Here is an example: "Select System info from the Administration panel."

Warnings or important notes appear like this.

Tips and tricks appear like this.

Get in touch

Feedback from our readers is always welcome:

General feedback: If you have questions about any aspect of this book, mention the book title in the subject of your message and email us at [email protected].

Errata: Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you have found a mistake in this book, we would be grateful if you would report this to us. Please visit www.packt.com/submit-errata, selecting your book, clicking on the Errata Submission Form link, and entering the details.

Piracy: If you come across any illegal copies of our works in any form on the Internet, we would be grateful if you would provide us with the location address or website name. Please contact us at [email protected] with a link to the material.

If you are interested in becoming an author: If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, please visit authors.packtpub.com.

Reviews

Please leave a review. Once you have read and used this book, why not leave a review on the site that you purchased it from? Potential readers can then see and use your unbiased opinion to make purchase decisions, we at Packt can understand what you think about our products, and our authors can see your feedback on their book. Thank you!

For more information about Packt, please visit packt.com.

Introduction to Apache Spark

Apache Spark is an open source framework for processing large datasets stored in heterogeneous data stores in an efficient and fast way.Sophisticated analytical algorithms can be easily executed on these large datasets. Spark can execute a distributed program 100 times faster than MapReduce. As Spark is one of the fast-growing projects in the open source community, it provides a large number of libraries to its users.

We shall cover the following topics in this chapter:

A brief introduction to Spark

Spark architecture and the different languages that can be used for coding Spark applications

Spark components and how these components can be used together to solve a variety of use cases

A comparison between Spark and Hadoop

What is Spark?

Apache Spark is a distributed computing framework which makes big-data processing quite easy, fast, and scalable. You must be wondering what makes Spark so popular in the industry, and how is it really different than the existing tools available for big-data processing? The reason is that it provides a unified stack for processing all different kinds of big data, be it batch, streaming, machine learning, or graph data.

Spark was developed at UC Berkeley’s AMPLab in 2009 and later came under the Apache Umbrella in 2010. The framework is mainly written in Scala and Java.

Spark provides an interface with many different distributed and non-distributed data stores, such as Hadoop Distributed File System (HDFS), Cassandra, Openstack Swift, Amazon S3, and Kudu. It also provides a wide variety of language APIs to perform analytics on the data stored in these data stores. These APIs include Scala, Java, Python, and R.

The basic entity of Spark is Resilient Distributed Dataset (RDD), which is a read-only partitioned collection of data. RDD can be created using data stored on different data stores or using existing RDD. We shall discuss this in more detail in Chapter 3, Spark RDD.