E-Book
38,39 €

Apache Spark for Data Science Cookbook E-Book

Padma Priya Chitturi

0,0

38,39 €

Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.

Mehr erfahren.

Herausgeber: Packt Publishing
Kategorie: Wissenschaft und neue Technologien
Sprache: Englisch

Beschreibung

Over insightful 90 recipes to get lightning-fast analytics with Apache Spark

About This Book

Use Apache Spark for data processing with these hands-on recipes
Implement end-to-end, large-scale data analysis better than ever before
Work with powerful libraries such as MLLib, SciPy, NumPy, and Pandas to gain insights from your data

Who This Book Is For

This book is for novice and intermediate level data science professionals and data analysts who want to solve data science problems with a distributed computing framework. Basic experience with data science implementation tasks is expected. Data science professionals looking to skill up and gain an edge in the field will find this book helpful.

What You Will Learn

Explore the topics of data mining, text mining, Natural Language Processing, information retrieval, and machine learning.
Solve real-world analytical problems with large data sets.
Address data science challenges with analytical tools on a distributed system like Spark (apt for iterative algorithms), which offers in-memory processing and more flexibility for data analysis at scale.
Get hands-on experience with algorithms like Classification, regression, and recommendation on real datasets using Spark MLLib package.
Learn about numerical and scientific computing using NumPy and SciPy on Spark.
Use Predictive Model Markup Language (PMML) in Spark for statistical data mining models.

In Detail

Spark has emerged as the most promising big data analytics engine for data science professionals. The true power and value of Apache Spark lies in its ability to execute data science tasks with speed and accuracy. Spark's selling point is that it combines ETL, batch analytics, real-time stream analysis, machine learning, graph processing, and visualizations. It lets you tackle the complexities that come with raw unstructured data sets with ease.

This guide will get you comfortable and confident performing data science tasks with Spark. You will learn about implementations including distributed deep learning, numerical computing, and scalable machine learning. You will be shown effective solutions to problematic concepts in data science using Spark's data science libraries such as MLLib, Pandas, NumPy, SciPy, and more. These simple and efficient recipes will show you how to implement algorithms and optimize your work.

Style and approach

This book contains a comprehensive range of recipes designed to help you learn the fundamentals and tackle the difficulties of data science. This book outlines practical steps to produce powerful insights into Big Data through a recipe-based approach.

Details

Sie lesen das E-Book in den Legimi-Apps auf:

Android

iOS

von Legimi
zertifizierten E-Readern

Seitenzahl: 383

Veröffentlichungsjahr: 2016

Bewertungen

0,0

Rezensionen(0 Rezensionen)

Leseprobe

Apache Spark for Data Science Cookbook

Credits

About the Author

About the Reviewer

www.PacktPub.com

Why subscribe?

Customer Feedback

Preface

What this book covers

What you need for this book

Who this book is for

Sections

Getting ready

How to do it…

How it works…

There's more…

See also

Conventions

Reader feedback

Customer support

Downloading the example code

Errata

Piracy

Questions

1. Big Data Analytics with Spark

Introduction

Initializing SparkContext

Getting ready

How to do it…

How it works…

There's more…

See also

Working with Spark's Python and Scala shells

How to do it…

How it works…

There's more…

See also

Building standalone applications

Getting ready

How to do it…

How it works…

There's more…

See also

Working with the Spark programming model

How to do it…

How it works…

There's more…

See also

Working with pair RDDs

Getting ready

How to do it…

How it works…

There's more…

See also

Persisting RDDs

Getting ready

How to do it…

How it works…

There's more…

See also

Loading and saving data

Getting ready

How to do it…

How it works…

There's more…

See also

Creating broadcast variables and accumulators

Getting ready

How to do it…

How it works…

There's more…

See also

Submitting applications to a cluster

Getting ready

How to do it…

How it works…

There's more…

See also

Working with DataFrames

Getting ready

How to do it…

How it works…

There's more…

See also

Working with Spark Streaming

Getting ready

How to do it…

How it works…

There's more…

See also

2. Tricky Statistics with Spark

Introduction

Working with Pandas

Variable identification

Getting ready

How to do it…

How it works…

There's more…

See also

Sampling data

Getting ready

How to do it…

How it works…

There's more…

See also

Summary and descriptive statistics

Getting ready

How to do it…

How it works…

There's more…

See also

Generating frequency tables

Getting ready

How to do it…

How it works…

There's more…

See also

Installing Pandas on Linux

Getting ready

How to do it…

How it works…

There's more…

See also

Installing Pandas from source

Getting ready

How to do it…

How it works…

There's more…

See also

Using IPython with PySpark

Getting ready

How to do it…

How it work…

There's more…

See also

Creating Pandas DataFrames over Spark

Getting ready

How to do it…

How it works…

There's more…

See also

Splitting, slicing, sorting, filtering, and grouping DataFrames over Spark

Getting ready

How to do it…

How it works…

There's more…

See also

Implementing co-variance and correlation using Pandas

Getting ready

How to do it…

How it works…

There's more…

See also

Concatenating and merging operations over DataFrames

Getting ready

How to do it…

How it works…

There's more…

See also

Complex operations over DataFrames

Getting ready

How to do it…

How it works…

There's more…

See also

Sparkling Pandas

Getting ready

How to do it…

How it works…

There's more…

See also

3. Data Analysis with Spark

Introduction

Univariate analysis

Getting ready

How to do it…

How it works…

There's more…

See also

Bivariate analysis

Getting ready

How to do it…

How it works…

There's more…

See also

Missing value treatment

Getting ready

How to do it…

How it works…

There's more…

See also

Outlier detection

Getting ready

How to do it…

How it works…

There's more…

See also

Use case - analyzing the MovieLens dataset

Getting ready

How to do it…

How it works…

There's more…

See also

Use case - analyzing the Uber dataset

Getting ready

How to do it…

How it works…

There's more…

See also

4. Clustering, Classification, and Regression

Introduction

Supervised learning

Unsupervised learning

Applying regression analysis for sales data

Variable identification

Getting ready

How to do it…

How it works…

There's more…

See also

Data exploration

Getting ready

How to do it…

How it works…

There's more…

See also

Feature engineering

Getting ready

How to do it…

How it works…

There's more…

See also

Applying linear regression

Getting ready

How to do it…

How it works…

There's more…

See also

Applying logistic regression on bank marketing data

Variable identification

Getting ready

How to do it…

How it works…

There's more…

See also

Data exploration

Getting ready

How to do it…

How it works…

There's more…

See also

Feature engineering

Getting ready

How to do it…

How it works…

There's more…

See also

Applying logistic regression

Getting ready

How to do it…

How it works…

There's more…

See also

Real-time intrusion detection using streaming k-means

Variable identification

Getting ready

How to do it…

How it works…

There's more…

See also

Simulating real-time data

Getting ready

How to do it…

How it works…

There's more…

See also

Applying streaming k-means

Getting ready

How to do it…

How it works…

There's more…

See also

5. Working with Spark MLlib

Introduction

Working with Spark ML pipelines

Implementing Naive Bayes' classification

Getting ready

How to do it…

How it works…

There's more…

See also

Implementing decision trees

Getting ready

How to do it…

How it works…

There's more…

See also

Building a recommendation system

Getting ready

How to do it…

How it works…

There's more…

See also

Implementing logistic regression using Spark ML pipelines

Getting ready

How to do it…

How it works…

There's more…

See also

6. NLP with Spark

Introduction

Installing NLTK on Linux

Getting ready

How to do it…

How it works…

There's more…

See also

Installing Anaconda on Linux

Getting ready

How to do it…

How it works…

There's more…

See also

Anaconda for cluster management

Getting ready

How to do it…

How it works…

There's more…

See also

POS tagging with PySpark on an Anaconda cluster

Getting ready

How to do it…

How it works…

There's more…

See also

NER with IPython over Spark

Getting ready

How to do it…

How it works…

There's more…

See also

Implementing openNLP - chunker over Spark

Getting ready

How to do it…

How it works…

There's more…

See also

Implementing openNLP - sentence detector over Spark

Getting ready

How to do it…

How it works…

There's more…

See also

Implementing stanford NLP - lemmatization over Spark

Getting ready

How to do it…

How it works…

There's more…

See also

Implementing sentiment analysis using stanford NLP over Spark

Getting ready

How to do it…

How it works…

There's more…

See also

7. Working with Sparkling Water - H2O

Introduction

Features

Working with H2O on Spark

Getting ready

How to do it…

How it works…

There's more…

See also

Implementing k-means using H2O over Spark

Getting ready

How to do it…

How it works…

There's more…

See also

Implementing spam detection with Sparkling Water

Getting ready

How to do it…

How it works…

There's more…

See also

Deep learning with airlines and weather data

Getting ready

How to do it…

How it works…

There's more…

See also

Implementing a crime detection application

Getting ready

How to do it…

How it works…

There's more…

See also

Running SVM with H2O over Spark

Getting ready

How to do it…

How it works…

There's more…

See also

8. Data Visualization with Spark

Introduction

Visualization using Zeppelin

Getting ready

How to do it…

Installing Zeppelin

Customizing Zeppelin's server and websocket port

Visualizing data on HDFS - parameterizing inputs

Running custom functions

Adding external dependencies to Zeppelin

Pointing to an external Spark Cluster

How to do it…

How it works…

There's more…

See also

Creating scatter plots with Bokeh-Scala

Getting ready

How to do it…

How it works…

There's more…

See also

Creating a time series MultiPlot with Bokeh-Scala

Getting ready

How to do it…

How it work…

There's more…

See also

Creating plots with the lightning visualization server

Getting ready

How to do it…

How it works…

There's more…

See also

Visualize machine learning models with Databricks notebook

Getting ready

How to do it…

How it works…

There's more…

See also

9. Deep Learning on Spark

Introduction

Installing CaffeOnSpark

Getting ready

How to do it…

How it works…

There's more…

See also

Working with CaffeOnSpark

Getting ready

How to do it…

How it works…

There's more…

See also

Running a feed-forward neural network with DeepLearning 4j over Spark

Getting ready

How to do it…

How it works…

There's more…

See also

Running an RBM with DeepLearning4j over Spark

Getting ready

How to do it…

How it works…

There's more…

See also

Running a CNN for learning MNIST with DeepLearning4j over Spark

Getting ready

How to do it…

How it works…

There's more…

See also

Installing TensorFlow

Getting ready

How to do it…

How it works…

There's more…

See also

Working with Spark TensorFlow

Getting ready

How to do it…

How it works…

There's more…

See also

10. Working with SparkR

Introduction

Installing R

Getting ready…

How to do it…

How it works…

There's more…

See also

Interactive analysis with the SparkR shell

Getting ready

How to do it…

How it works…

There's more…

See also

Creating a SparkR standalone application from RStudio

Getting ready

How to do it…

How it works…

There's more…

See also

Creating SparkR DataFrames

Getting ready

How to do it…

How it works…

There's more…

See also

SparkR DataFrame operations

Getting ready

How to do it…

How it works…

There's more…

See also

Applying user-defined functions in SparkR

Getting ready

How to do it…

How it works…

There's more…

See also

Running SQL queries from SparkR and caching DataFrames

Getting ready

How to do it…

How it works…

There's more…

See also

Machine learning with SparkR

Getting ready

How to do it…

How it works…

There's more…

See also

Apache Spark for Data Science Cookbook

All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing, and its dealers and distributors will be held liable for any damages caused or alleged to be caused directly or indirectly by this book.

Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

First published: December 2016

Production reference: 1161216

Published by Packt Publishing Ltd.

Livery Place

35 Livery Street

Birmingham

B3 2PB, UK.

ISBN 978-1-78588-010-0

www.packtpub.com

Credits

Author

Padma Priya Chitturi

Copy Editor

Safis Editing

Reviewer

Roberto Corizzo

Project Coordinator

Shweta H Birwatkar

Commissioning Editor

Akram Hussain

Proofreader

Safis Editing

Acquisition Editors

Vinay Argekar

Manish Nainani

Indexer

Mariammal Chettiyar

Content Development Editor

Sumeet Sawant

Graphics

Disha Haria

Technical Editor

Deepti Tuscano

Production Coordinator

Arvindkumar Gupta

About the Author

Padma Priya Chitturi is Analytics Lead at Fractal Analytics Pvt Ltd and has over five years of experience in Big Data processing. Currently, she is part of capability development at Fractal and responsible for solution development for analytical problems across multiple business domains at large scale. Prior to this, she worked for an Airlines product on a real-time processing platform serving one million user requests/sec at Amadeus Software Labs. She has worked on realizing large-scale deep networks (Jeffrey dean's work in Google brain) for image classification on the big data platform Spark. She works closely with Big Data technologies such as Spark, Storm, Cassandra and Hadoop. She was an open source contributor to Apache Storm.

First, I would like to thank the Packt Publishing team for providing a great opportunity for me to take part in this exciting journey and would like to express my special thanks and gratitude to my family, friends and colleagues who has been very supportive and helped me in finishing this project within time.

About the Reviewer

Roberto Corizzo is a PhD student at the Department of Computer Science, University of Bari, Italy. His research interests include Big Data analytics, data mining, and predictive modeling techniques for sensor networks. He has been involved as technical reviewer for Packt's Learning Hadoop 2 and Learning Python Web Penetration Testing video courses.

www.PacktPub.com

For support files and downloads related to your book, please visit www.PacktPub.com.

Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.PacktPub.com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at [email protected] for more details.

At www.PacktPub.com, you can also read a collection of free technical articles, sign up for a range of free newsletters and receive exclusive discounts and offers on Packt books and eBooks.

https://www.packtpub.com/mapt

Get the most in-demand software skills with Mapt. Mapt gives you full access to all Packt books and video courses, as well as industry-leading tools to help you plan your personal development and advance your career.

Why subscribe?

Fully searchable across every book published by PacktCopy and paste, print, and bookmark contentOn demand and accessible via a web browser

Customer Feedback

Thank you for purchasing this Packt book. We take our commitment to improving our content and products to meet your needs seriously – that's why your feedback is so valuable. Whatever your feelings about your purchase, please consider leaving a review on this book's Amazon page. Not only will this help us, more importantly it will also help others in the community to make an informed decision about the resources that they invest in to learn. You can also review for us on a regular basis by joining our reviewers club. If you're interested in joining, or would like to learn more about the benefits we offer, please contact us: [email protected].

Preface

In recent years, the volume of data being collected, stored, and analyzed has exploded, in particular in relation to the activity on the Web and mobile devices, as well as data from the physical world collected via sensor networks. While previously large-scale data storage, processing, analysis, and modeling was the domain of the largest institutions such as Google, Yahoo!, Facebook, and Twitter, increasingly, many organizations are being faced with the challenge of how to handle a massive amount of data.

With the advent of big data, extracting knowledge from large, heterogeneous, and noisy datasets requires not only powerful computing resources, but the programming abstractions to use them effectively. The abstractions that emerged in the last decade blend ideas from parallel databases, distributed systems, and programming languages to create a new class of scalable data analytics platforms that form the foundation for data science at realistic scales.

The objective of this book is to get the audience the flavor of challenges in data science and addressing them with a variety of analytical tools on a distributed system such as Spark (apt for iterative algorithms), which offers in-memory processing and more flexible for data analysis at scale. This book introduces readers to the fundamentals of Spark and helps them learn the concepts with code examples. It also talks in brief about data mining, text mining, NLP, machine learning, and so on. The readers get to know how to solve real-world analytical problems with large datasets and are made aware of a very practical approach and code to use analytical tools that leverage the features of Spark.

What this book covers

Chapter 1, Big Data Analytics with Spark, introduces Scala, Python and R can be used for data analysis. It also details about Spark programming model, API will be introduced, shows how to install, set up a development environment for the Spark framework and run jobs in distributed mode. I will also show working with DataFrames and Streaming computation models.

Chapter 2, Tricky Statistics with Spark, shows how to apply various statistical measures such as generating sample data, constructing frequency tables, summary and descriptive statistics on large datasets using Spark and Pandas

Chapter 3, Data Analysis with Spark, details how to apply common data exploration and preparation techniques such as univariate analysis, bivariate analysis, missing values treatment, identifying the outliers and techniques for variable transformation using Spark.

Chapter 4, Clustering, Classification and Regression, deals with creating models for regression, classification and clustering as well as shows how to utilize standard performance-evaluation methodologies for the machine learning models built.

Chapter 5, Working with Spark MLlib, provides an overview of Spark MLlib and ML pipelines and presents examples for implementing Naive Bayes classification, decision trees and recommendation systems.

Chapter 6, NLP with Spark, shows how to install NLTK, Anaconda and apply NLP tasks such as POS tagging, Named Entity Recognition, Chunker, Sentence Detector, Lemmatization using Core NLP and Stanford NLP over Spark.

Chapter 7, Working with Sparkling Water - H2O, details how to integrate H2O with Spark and shows applying various algorithms such as k-means, deep learning and SVM and also show developing applications –spam detection and crime detection with Sparkling Water.

Chapter 8, Data Visualization with Spark, show the integration of widely used visualization tools such as Zeppelin, Lightning Server and highly active Scala bindings (Bokeh-Scala) for visualizing large data sets.

Chapter 9, Deep Learning on Spark, shows how to implement deep learning algorithms such as RBM, CNN for learning MNIST, Feed-forward neural networks with the tools Deep Learning4j, TensorFlow using Spark.

Chapter 10, Working with SparkR, provides examples on creating distributed data frames in R, various operations that could be applied in SparkR and details on applying user-defined functions, SQL queries and machine learning in SparkR.

What you need for this book

Throughout this book, we assume that you have some basic experience with programming in Scala, Java, or Python and have some basic knowledge of machine learning, statistics, and data analysis.

Who this book is for

This book is intended for entry-level to intermediate data scientists, data analysts, engineers and practitioners who want to get acquainted with solving numerous data science problems using a distributed computing framework like Spark. The readers are expected to have knowledge on statistics, data science tools like R, Pandas and understanding on distributed systems (some exposure to Hadoop).

Sections

In this book, you will find several headings that appear frequently (Getting ready, How to do it, How it works, There's more, and See also).

To give clear instructions on how to complete a recipe, we use these sections as follows:

Getting ready

This section tells you what to expect in the recipe, and describes how to set up any software or any preliminary settings required for the recipe.

How to do it…

This section contains the steps required to follow the recipe.

How it works…

This section usually consists of a detailed explanation of what happened in the previous section.

There's more…

This section consists of additional information about the recipe in order to make the reader more knowledgeable about the recipe.

Reader feedback

Feedback from our readers is always welcome. Let us know what you think about this book-what you liked or disliked. Reader feedback is important for us as it helps us develop titles that you will really get the most out of.

To send us general feedback, simply e-mail [email protected], and mention the book's title in the subject of your message.

If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, see our author guide at www.packtpub.com/authors.

Customer support

Now that you are the proud owner of a Packt book, we have a number of things to help you to get the most from your purchase.

Downloading the example code

You can download the example code files for this book from your account at http://www.packtpub.com. If you purchased this book elsewhere, you can visit http://www.packtpub.com/support and register to have the files e-mailed directly to you.

You can download the code files by following these steps:

Log in or register to our website using your e-mail address and password.Hover the mouse pointer on the SUPPORT tab at the top.Click on Code Downloads & Errata.Enter the name of the book in the Search box.Select the book for which you're looking to download the code files.Choose from the drop-down menu where you purchased this book from.Click on Code Download.

You can also download the code files by clicking on the Code Files button on the book's webpage at the Packt Publishing website. This page can be accessed by entering the book's name in the Search box. Please note that you need to be logged in to your Packt account.

Once the file is downloaded, please make sure that you unzip or extract the folder using the latest version of:

WinRAR / 7-Zip for WindowsZipeg / iZip / UnRarX for Mac7-Zip / PeaZip for Linux

The code bundle for the book is also hosted on GitHub at https://github.com/ChitturiPadma/SparkforDataScienceCookbook. We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/. Check them out!

Errata

Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you find a mistake in one of our books-maybe a mistake in the text or the code-we would be grateful if you could report this to us. By doing so, you can save other readers from frustration and help us improve subsequent versions of this book. If you find any errata, please report them by visiting http://www.packtpub.com/submit-errata, selecting your book, clicking on the Errata Submission Form link, and entering the details of your errata. Once your errata are verified, your submission will be accepted and the errata will be uploaded to our website or added to any list of existing errata under the Errata section of that title.

To view the previously submitted errata, go to https://www.packtpub.com/books/content/support and enter the name of the book in the search field. The required information will appear under the Errata section.

Piracy

Piracy of copyrighted material on the Internet is an ongoing problem across all media. At Packt, we take the protection of our copyright and licenses very seriously. If you come across any illegal copies of our works in any form on the Internet, please provide us with the location address or website name immediately so that we can pursue a remedy.

Please contact us at [email protected] with a link to the suspected pirated material.

We appreciate your help in protecting our authors and our ability to bring you valuable content.

Questions

If you have a problem with any aspect of this book, you can contact us at [email protected], and we will do our best to address the problem.

Chapter 1. Big Data Analytics with Spark

In this chapter, we will cover the components of Spark. You will learn them through the following recipes:

Initializing SparkContextWorking with Spark's Python and Scala shellsBuilding standalone applicationsWorking with the Spark programming modelWorking with pair RDDsPersisting RDDsLoading and saving dataCreating broadcast variables and accumulatorsSubmitting applications to a clusterWorking with DataFramesWorking with Spark Streaming

Introduction

Apache Spark is a general-purpose distributed computing engine for large-scale data processing. It is an open source initiative from AMPLab and donated to the Apache Software Foundation. It is one of the top-level projects under the Apache Software Foundation. Apache Spark offers a data abstraction called Resilient Distributed Datasets (RDDs) to analyze the data in parallel on top of a cluster of resources. The Apache Spark framework is an alternative to Hadoop MapReduce. It is up to 100X faster than MapReduce and offers the best APIs for iterative and expressive data processing. This project is written in Scala and it offers client APIs in Scala, Java, Python, and R.