E-Book
35,99 €

Spark Cookbook E-Book

Rishi Yadav

0,0

35,99 €

Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.

Mehr erfahren.

Herausgeber: Packt Publishing
Kategorie: Fachliteratur
Sprache: Englisch

Beschreibung

If you are a data engineer, an application developer, or a data scientist who would like to leverage the power of Apache Spark to get better insights from big data, then this is the book for you.

Details

Das E-Book können Sie in Legimi-Apps oder einer beliebigen App lesen, die das folgende Format unterstützen:

EPUB

MOBI

Seitenzahl: 215

Veröffentlichungsjahr: 2015

Bewertungen

0,0

Rezensionen(0 Rezensionen)

Ähnliche

Apache Spark 2.x Cookbook

Rishi Yadav

Der Weg zum erfolgreichen Unternehmer

Stefan Merath

Der Weg zum erfolgreichen Unternehmer

Stefan Merath

Denke (nach) und werde reich

Napoleon Hill

30 Minuten Resilienz

Ulrich Siegrist

Krebszellen mögen keine Himbeeren - Der große Bestseller - Vollständig überarbeitet und aktualisiert

Richard Béliveau

Die Hormonrevolution

Michael E Platt

Der Crash ist die Lösung

Matthias Weik

Günter, der innere Schweinehund, lernt verkaufen

Stefan Frädrich

Mission erfüllt

Owen Mark

Die Leber wächst mit ihren Aufgaben

Dr. med. Eckart von Hirschhausen

Macht, was ihr liebt!

Anja Förster

Kopf schlägt Kapital

Günter Faltin

Der größte Raubzug der Geschichte

Matthias Weik

Der Mann und das Holz

Lars Mytting

Unsere Hunde - gesund durch Homöopathie

Hans Günter Wolff

Die Jahrhundertlüge, die nur Insider kennen

Heiko Schrang

Organisation für Komplexität

Niels Pfläging

Power: Die 48 Gesetze der Macht

Robert Greene

The Truth About Employee Engagement

Patrick M. Lencioni

Leseprobe

Spark Cookbook

Credits

About the Author

About the Reviewers

www.PacktPub.com

Support files, eBooks, discount offers, and more

Why Subscribe?

Free Access for Packt account holders

Preface

What this book covers

What you need for this book

Who this book is for

Sections

Getting ready

How to do it…

How it works…

There's more…

See also

Conventions

Reader feedback

Customer support

Downloading the color images of this book

Errata

Piracy

Questions

1. Getting Started with Apache Spark

Introduction

Installing Spark from binaries

Getting ready

How to do it...

Building the Spark source code with Maven

Getting ready

How to do it...

Launching Spark on Amazon EC2

Getting ready

How to do it...

See also

Deploying on a cluster in standalone mode

Getting ready

How to do it...

How it works...

See also

Deploying on a cluster with Mesos

How to do it...

Deploying on a cluster with YARN

Getting ready

How to do it...

How it works…

Using Tachyon as an off-heap storage layer

How to do it...

See also

2. Developing Applications with Spark

Introduction

Exploring the Spark shell

How to do it...

Developing Spark applications in Eclipse with Maven

Getting ready

How to do it...

Developing Spark applications in Eclipse with SBT

How to do it...

Developing a Spark application in IntelliJ IDEA with Maven

How to do it...

Developing a Spark application in IntelliJ IDEA with SBT

How to do it...

3. External Data Sources

Introduction

Loading data from the local filesystem

How to do it...

Loading data from HDFS

How to do it...

There's more…

Loading data from HDFS using a custom InputFormat

How to do it...

Loading data from Amazon S3

How to do it...

Loading data from Apache Cassandra

How to do it...

There's more...

Merge strategies in sbt-assembly

Loading data from relational databases

Getting ready

How to do it...

How it works…

4. Spark SQL

Introduction

Understanding the Catalyst optimizer

How it works…

Analysis

Logical plan optimization

Physical planning

Code generation

Creating HiveContext

Getting ready

How to do it...

Inferring schema using case classes

How to do it...

Programmatically specifying the schema

How to do it...

How it works…

Loading and saving data using the Parquet format

How to do it...

How it works…

There's more…

Loading and saving data using the JSON format

How to do it...

How it works…

There's more…

Loading and saving data from relational databases

Getting ready

How to do it...

Loading and saving data from an arbitrary source

How to do it...

There's more…

5. Spark Streaming

Introduction

Word count using Streaming

How to do it...

Streaming Twitter data

How to do it...

Streaming using Kafka

Getting ready

How to do it...

There's more…

6. Getting Started with Machine Learning Using MLlib

Introduction

Creating vectors

How to do it…

How it works...

Creating a labeled point

How to do it…

Creating matrices

How to do it…

Calculating summary statistics

How to do it…

Calculating correlation

Getting ready

How to do it…

Doing hypothesis testing

How to do it…

Creating machine learning pipelines using ML

Getting ready

How to do it…

7. Supervised Learning with MLlib – Regression

Introduction

Using linear regression

Getting ready

How to do it…

Understanding cost function

Doing linear regression with lasso

How to do it…

Doing ridge regression

How to do it…

8. Supervised Learning with MLlib – Classification

Introduction

Doing classification using logistic regression

Getting ready

How to do it…

Doing binary classification using SVM

How to do it…

Doing classification using decision trees

Getting ready

How to do it…

How it works…

Doing classification using Random Forests

Getting ready

How to do it…

How it works…

Doing classification using Gradient Boosted Trees

Getting ready

How to do it…

Doing classification with Naïve Bayes

Getting ready

How to do it…

9. Unsupervised Learning with MLlib

Introduction

Clustering using k-means

Getting ready

How to do it…

Dimensionality reduction with principal component analysis

Getting ready

How to do it…

Dimensionality reduction with singular value decomposition

Getting ready

How to do it…

10. Recommender Systems

Introduction

Collaborative filtering using explicit feedback

Getting ready

How to do it…

Collaborative filtering using implicit feedback

Getting ready

How to do it…

How it works…

There's more…

11. Graph Processing Using GraphX

Introduction

Fundamental operations on graphs

Getting ready

How to do it…

Using PageRank

Getting ready

How to do it…

Finding connected components

Getting ready

How to do it…

Performing neighborhood aggregation

Getting ready

How to do it…

12. Optimizations and Performance Tuning

Introduction

Optimizing memory

Using compression to improve performance

Using serialization to improve performance

How to do it…

Optimizing garbage collection

How to do it…

Optimizing the level of parallelism

How to do it…

Understanding the future of optimization – project Tungsten

Manual memory management by leverage application semantics

Using algorithms and data structures

Code generation

Index

Spark Cookbook

All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing, and its dealers and distributors will be held liable for any damages caused or alleged to be caused directly or indirectly by this book.

Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

First published: July 2015

Production reference: 1160715

Published by Packt Publishing Ltd.

Livery Place

35 Livery Street

Birmingham B3 2PB, UK.

ISBN 978-1-78398-706-1

www.packtpub.com

Cover image by: InfoObjects design team

Credits

Author

Rishi Yadav

Reviewers

Thomas W. Dinsmore

Cheng Lian

Amir Sedighi

Commissioning Editor

Kunal Parikh

Acquisition Editors

Shaon Basu

Neha Nagwekar

Content Development Editor

Ritika Singh

Technical Editor

Ankita Thakur

Copy Editors

Ameesha Smith-Green

Swati Priya

Project Coordinator

Milton Dsouza

Proofreader

Safis Editing

Indexer

Mariammal Chettiyar

Graphics

Sheetal Aute

Production Coordinator

Nilesh R. Mohite

Cover Work

Nilesh R. Mohite

About the Author

Rishi Yadav has 17 years of experience in designing and developing enterprise applications. He is an open source software expert and advises American companies on big data trends. Rishi was honored as one of Silicon Valley's 40 under 40 in 2014. He finished his bachelor's degree at the prestigious Indian Institute of Technology (IIT) Delhi in 1998.

About 10 years ago, Rishi started InfoObjects, a company that helps data-driven businesses gain new insights into data.

InfoObjects combines the power of open source and big data to solve business challenges for its clients and has a special focus on Apache Spark. The company has been on the Inc. 5000 list of the fastest growing companies for 4 years in a row. InfoObjects has also been awarded with the #1 best place to work in the Bay Area in 2014 and 2015.

Rishi is an open source contributor and active blogger.

My special thanks go to my better half, Anjali, for putting up with the long, arduous hours that were added to my already swamped schedule; our 8 year old son, Vedant, who tracked my progress on a daily basis; InfoObjects' CTO and my business partner, Sudhir Jangir, for leading the big data effort in the company; Helma Zargarian, Yogesh Chandani, Animesh Chauhan, and Katie Nelson for running operations skillfully so that I could focus on this book; and our internal review team, especially Arivoli Tirouvingadame, Lalit Shravage, and Sanjay Shroff, for helping with the review. I could not have written without your support. I would also like to thank Marcel Izumi for putting together amazing graphics.

About the Reviewers

Thomas W. Dinsmore is an independent consultant, offering product advisory services to analytic software vendors. To this role, he brings 30 years of experience, delivering analytics solutions to enterprises around the world. He uniquely combines hands-on analytics experience with the ability to lead analytic projects and interpret results.

Thomas' previous services include roles with SAS, IBM, The Boston Consulting Group, PricewaterhouseCoopers, and Oliver Wyman.

Thomas coauthored Modern Analytics Methodologies and Advanced Analytics Methodologies, published in 2014 by Pearson FT Press, and is under contract for a forthcoming book on business analytics from Apress. He publishes The Big Analytics Blog at www.thomaswdinsmore.com.

I would like to thank the entire editorial and production team at Packt Publishing, who work tirelessly to bring out quality books to the public.

Cheng Lian is a Chinese software engineer and Apache Spark committer from Databricks. His major technical interests include big data analytics, distributed systems, and functional programming languages.

Cheng is also the translator of the Chinese edition of Erlang and OTP in Action and Concurrent Programming in Erlang (Part I).

I would like to thank Yi Tian from AsiaInfo for helping me review some parts of Chapter 6, Getting Started with Machine Learning Using MLlib.

Amir Sedighi is an experienced software engineer, a keen learner, and a creative problem solver. His experience spans a wide range of software development areas, including cross-platform development, big data processing and data streaming, information retrieval, and machine learning. He is a big data lecturer and expert, working in Iran. He holds a bachelor's and master's degree in software engineering. Amir is currently the CEO of Rayanesh Dadegan Ekbatan, the company he cofounded in 2013 after several years of designing and implementing distributed big data and data streaming solutions for private sector companies.

I would like to thank the entire team at Packt Publishing, who work hard to bring awesomeness to the books and the readers' professional life.

www.PacktPub.com

Support files, eBooks, discount offers, and more

For support files and downloads related to your book, please visit www.PacktPub.com.

Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.PacktPub.com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at <[email protected]> for more details.

At www.PacktPub.com, you can also read a collection of free technical articles, sign up for a range of free newsletters and receive exclusive discounts and offers on Packt books and eBooks.

https://www2.packtpub.com/books/subscription/packtlib

Do you need instant solutions to your IT questions? PacktLib is Packt's online digital book library. Here, you can search, access, and read Packt's entire library of books.

Why Subscribe?

Fully searchable across every book published by PacktCopy and paste, print, and bookmark contentOn demand and accessible via a web browser

Free Access for Packt account holders

If you have an account with Packt at www.PacktPub.com, you can use this to access PacktLib today and view nine entirely free books. Simply use your login credentials for immediate access.

Preface

The success of Hadoop as a big data platform raised user expectations, both in terms of solving different analytics challenges as well as reducing latency. Various tools evolved over time, but when Apache Spark came, it provided one single runtime to address all these challenges. It eliminated the need to combine multiple tools with their own challenges and learning curves. By using memory for persistent storage besides compute, Apache Spark eliminates the need to store intermedia data in disk and increases processing speed up to 100 times. It also provides a single runtime, which addresses various analytics needs such as machine-learning and real-time streaming using various libraries.

This book covers the installation and configuration of Apache Spark and building solutions using Spark Core, Spark SQL, Spark Streaming, MLlib, and GraphX libraries.

Note

For more information on this book's recipes, please visit infoobjects.com/spark-cookbook.

What this book covers

Chapter 1, Getting Started with Apache Spark, explains how to install Spark on various environments and cluster managers.

Chapter 2, Developing Applications with Spark, talks about developing Spark applications on different IDEs and using different build tools.

Chapter 3, External Data Sources, covers how to read and write to various data sources.

Chapter 4, Spark SQL, takes you through the Spark SQL module that helps you to access the Spark functionality using the SQL interface.

Chapter 5, Spark Streaming, explores the Spark Streaming library to analyze data from real-time data sources, such as Kafka.

Chapter 6, Getting Started with Machine Learning Using MLlib, covers an introduction to machine learning and basic artifacts such as vectors and matrices.

Chapter 7, Supervised Learning with MLlib – Regression, walks through supervised learning when the outcome variable is continuous.

Chapter 8, Supervised Learning with MLlib – Classification, discusses supervised learning when the outcome variable is discrete.

Chapter 9, Unsupervised Learning with MLlib, covers unsupervised learning algorithms such as k-means.

Chapter 10, Recommender Systems, introduces building recommender systems using various techniques, such as ALS.

Chapter 11, Graph Processing Using GraphX, talks about various graph processing algorithms using GraphX.

Chapter 12, Optimizations and Performance Tuning, covers various optimizations on Apache Spark and performance tuning techniques.

What you need for this book

You need the InfoObjects Big Data Sandbox software to proceed with the examples in this book. This software can be downloaded from http://www.infoobjects.com.

Who this book is for

If you are a data engineer, an application developer, or a data scientist who would like to leverage the power of Apache Spark to get better insights from big data, then this is the book for you.

Sections

In this book, you will find several headings that appear frequently (Getting ready, How to do it, How it works, There's more, and See also).

To give clear instructions on how to complete a recipe, we use these sections as follows:

Getting ready

This section tells you what to expect in the recipe, and describes how to set up any software or any preliminary settings required for the recipe.

How to do it…

This section contains the steps required to follow the recipe.

How it works…

This section usually consists of a detailed explanation of what happened in the previous section.

There's more…

This section consists of additional information about the recipe in order to make the reader more knowledgeable about the recipe.

Reader feedback

Feedback from our readers is always welcome. Let us know what you think about this book—what you liked or disliked. Reader feedback is important for us as it helps us develop titles that you will really get the most out of.

To send us general feedback, simply e-mail <[email protected]>, and mention the book's title in the subject of your message.

If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, see our author guide at www.packtpub.com/authors.

Customer support

Now that you are the proud owner of a Packt book, we have a number of things to help you to get the most from your purchase.

Downloading the color images of this book

We also provide you with a PDF file that has color images of the screenshots/diagrams used in this book. The color images will help you better understand the changes in the output. You can download this file from: https://www.packtpub.com/sites/default/files/downloads/7061OS_ColorImages.pdf.

Errata

Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you find a mistake in one of our books—maybe a mistake in the text or the code—we would be grateful if you could report this to us. By doing so, you can save other readers from frustration and help us improve subsequent versions of this book. If you find any errata, please report them by visiting http://www.packtpub.com/submit-errata, selecting your book, clicking on the ErrataSubmissionForm link, and entering the details of your errata. Once your errata are verified, your submission will be accepted and the errata will be uploaded to our website or added to any list of existing errata under the Errata section of that title.

To view the previously submitted errata, go to https://www.packtpub.com/books/content/support and enter the name of the book in the search field. The required information will appear under the Errata section.

Piracy

Piracy of copyrighted material on the Internet is an ongoing problem across all media. At Packt, we take the protection of our copyright and licenses very seriously. If you come across any illegal copies of our works in any form on the Internet, please provide us with the location address or website name immediately so that we can pursue a remedy.

Please contact us at <[email protected]> with a link to the suspected pirated material.

We appreciate your help in protecting our authors and our ability to bring you valuable content.

Questions

If you have a problem with any aspect of this book, you can contact us at <[email protected]>, and we will do our best to address the problem.

Chapter 1. Getting Started with Apache Spark

In this chapter, we will set up Spark and configure it. This chapter is divided into the following recipes:

Installing Spark from binariesBuilding the Spark source code with MavenLaunching Spark on Amazon EC2Deploying Spark on a cluster in standalone modeDeploying Spark on a cluster with MesosDeploying Spark on a cluster with YARNUsing Tachyon as an off-heap storage layer

Introduction

Apache Spark is a general-purpose cluster computing system to process big data workloads. What sets Spark apart from its predecessors, such as MapReduce, is its speed, ease-of-use, and sophisticated analytics.

Apache Spark was originally developed at AMPLab, UC Berkeley, in 2009. It was made open source in 2010 under the BSD license and switched to the Apache 2.0 license in 2013. Toward the later part of 2013, the creators of Spark founded Databricks to focus on Spark's development and future releases.

Talking about speed, Spark can achieve sub-second latency on big data workloads. To achieve such low latency, Spark makes use of the memory for storage. In MapReduce, memory is primarily used for actual computation. Spark uses memory both to compute and store objects.

Spark also provides a unified runtime connecting to various big data storage sources, such as HDFS, Cassandra, HBase, and S3. It also provides a rich set of higher-level libraries for different big data compute tasks, such as machine learning, SQL processing, graph processing, and real-time streaming. These libraries make development faster and can be combined in an arbitrary fashion.

Though Spark is written in Scala, and this book only focuses on recipes in Scala, Spark also supports Java and Python.

Spark is an open source community project, and everyone uses the pure open source Apache distributions for deployments, unlike Hadoop, which has multiple distributions available with vendor enhancements.

The following figure shows the Spark ecosystem:

The Spark runtime runs on top of a variety of cluster managers, including YARN (Hadoop's compute framework), Mesos, and Spark's own cluster manager called standalone mode. Tachyon is a memory-centric distributed file system that enables reliable file sharing at memory speed across cluster frameworks. In short, it is an off-heap storage layer in memory, which helps share data across jobs and users. Mesos is a cluster manager, which is evolving into a data center operating system. YARN is Hadoop's compute framework that has a robust resource management feature that Spark can seamlessly use.