45,59 €
Unlock the complexities of machine learning algorithms in Spark to generate useful data insights through this data analysis tutorial
Are you a developer with a background in machine learning and statistics who is feeling limited by the current slow and “small data” machine learning tools? Then this is the book for you! In this book, you will create scalable machine learning applications to power a modern data-driven business using Spark. We assume that you already know the machine learning concepts and algorithms and have Spark up and running (whether on a cluster or locally) and have a basic knowledge of the various libraries contained in Spark.
The purpose of machine learning is to build systems that learn from data. Being able to understand trends and patterns in complex data is critical to success; it is one of the key strategies to unlock growth in the challenging contemporary marketplace today. With the meteoric rise of machine learning, developers are now keen on finding out how can they make their Spark applications smarter.
This book gives you access to transform data into actionable knowledge. The book commences by defining machine learning primitives by the MLlib and H2O libraries. You will learn how to use Binary classification to detect the Higgs Boson particle in the huge amount of data produced by CERN particle collider and classify daily health activities using ensemble Methods for Multi-Class Classification.
Next, you will solve a typical regression problem involving flight delay predictions and write sophisticated Spark pipelines. You will analyze Twitter data with help of the doc2vec algorithm and K-means clustering. Finally, you will build different pattern mining models using MLlib, perform complex manipulation of DataFrames using Spark and Spark SQL, and deploy your app in a Spark streaming environment.
This book takes a practical approach to help you get to grips with using Spark for analytics and to implement machine learning algorithms. We'll teach you about advanced applications of machine learning through illustrative examples. These examples will equip you to harness the potential of machine learning, through Spark, in a variety of enterprise-grade systems.
Sie lesen das E-Book in den Legimi-Apps auf:
Seitenzahl: 409
Veröffentlichungsjahr: 2017
BIRMINGHAM - MUMBAI
Copyright © 2017 Packt Publishing
All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.
Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the authors, nor Packt Publishing, and its dealers and distributors will be held liable for any damages caused or alleged to be caused directly or indirectly by this book.
Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.
First published: August 2017
Production reference: 1290817
ISBN 978-1-78528-345-1
www.packtpub.com
Author
Alex Tellez
Max Pumperla
Michal Malohlava
Copy Editor
Muktikant Garimella
Reviewer
Dipanjan Deb
Project Coordinator
Ulhas Kambali
Commissioning Editor
Veena Pagare
Proofreader
Safis Editing
Acquisition Editor
Larissa Pinto
Indexer
Rekha Nair
Content Development Editor
Nikhil Borkar
Graphics
Jason Monteiro
Technical Editor
Diwakar Shukla
Production Coordinator
Melwyn Dsa
Alex Tellez is a life-long data hacker/enthusiast with a passion for data science and its application to business problems. He has a wealth of experience working across multiple industries, including banking, health care, online dating, human resources, and online gaming. Alex has also given multiple talks at various AI/machine learning conferences, in addition to lectures at universities about neural networks. When he’s not neck-deep in a textbook, Alex enjoys spending time with family, riding bikes, and utilizing machine learning to feed his French wine curiosity!
Max Pumperla is a data scientist and engineer specializing in deep learning and its applications. He currently works as a deep learning engineer at Skymind and is a co-founder of aetros.com. Max is the author and maintainer of several Python packages, including elephas, a distributed deep learning library using Spark. His open source footprint includes contributions to many popular machine learning libraries, such as keras, deeplearning4j, and hyperopt. He holds a PhD in algebraic geometry from the University of Hamburg.
Michal Malohlava, creator of Sparkling Water, is a geek and the developer; Java, Linux, programming languages enthusiast who has been developing software for over 10 years. He obtained his PhD from Charles University in Prague in 2012, and post doctorate from Purdue University.
During his studies, he was interested in the construction of not only distributed but also embedded and real-time, component-based systems, using model-driven methods and domain-specific languages. He participated in the design and development of various systems, including SOFA and Fractal component systems and the jPapabench control system.
Now, his main interest is big data computation. He participates in the development of the H2O platform for advanced big data math and computation, and its embedding into Spark engine, published as a project called Sparkling Water.
Dipanjan Deb is an experienced analytic professional with over 17 years of cumulative experience in machine/statistical learning, data mining and predictive analytics across finance, healthcare, automotive, CPG, automotive, energy, and human resource domains. He is highly proficient in developing cutting-edge analytic solutions using open source and commercial software to integrate multiple systems in order to provide massively parallelized and large-scale optimization.
For support files and downloads related to your book, please visit www.PacktPub.com. Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.PacktPub.comand as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at [email protected] for more details. At www.PacktPub.com, you can also read a collection of free technical articles, sign up for a range of free newsletters and receive exclusive discounts and offers on Packt books and eBooks.
https://www.packtpub.com/mapt
Get the most in-demand software skills with Mapt. Mapt gives you full access to all Packt books and video courses, as well as industry-leading tools to help you plan your personal development and advance your career.
Fully searchable across every book published by Packt
Copy and paste, print, and bookmark content
On demand and accessible via a web browser
Thanks for purchasing this Packt book. At Packt, quality is at the heart of our editorial process. To help us improve, please leave us an honest review on this book's Amazon page at https://www.amazon.com/dp/1785283456.
If you'd like to join our team of regular reviewers, you can e-mail us at [email protected]. We award our regular reviewers with free eBooks and videos in exchange for their valuable feedback. Help us be relentless in improving our products!
Preface
What this book covers
What you need for this book
Who this book is for
Conventions
Reader feedback
Customer support
Downloading the example code
Downloading the color images of this book
Errata
Piracy
Questions
Introduction to Large-Scale Machine Learning and Spark
Data science
The sexiest role of the 21st century – data scientist?
A day in the life of a data scientist
Working with big data
The machine learning algorithm using a distributed environment
Splitting of data into multiple machines
From Hadoop MapReduce to Spark
What is Databricks?
Inside the box
Introducing H2O.ai
Design of Sparkling Water
What's the difference between H2O and Spark's MLlib?
Data munging
Data science - an iterative process
Summary
Detecting Dark Matter - The Higgs-Boson Particle
Type I versus type II error
Finding the Higgs-Boson particle
The LHC and data creation
The theory behind the Higgs-Boson
Measuring for the Higgs-Boson
The dataset
Spark start and data load
Labeled point vector
Data caching
Creating a training and testing set
What about cross-validation?
Our first model – decision tree
Gini versus Entropy
Next model – tree ensembles
Random forest model
Grid search
Gradient boosting machine
Last model - H2O deep learning
Build a 3-layer DNN
Adding more layers
Building models and inspecting results
Summary
Ensemble Methods for Multi-Class Classification
Data
Modeling goal
Challenges
Machine learning workflow
Starting Spark shell
Exploring data
Missing data
Summary of missing value analysis
Data unification
Missing values
Categorical values
Final transformation
Modelling data with Random Forest
Building a classification model using Spark RandomForest
Classification model evaluation
Spark model metrics
Building a classification model using H2O RandomForest
Summary
Predicting Movie Reviews Using NLP and Spark Streaming
NLP - a brief primer
The dataset
Dataset preparation
Feature extraction
Feature extraction method– bag-of-words model
Text tokenization
Declaring our stopwords list
Stemming and lemmatization
Featurization - feature hashing
Term Frequency - Inverse Document Frequency (TF-IDF) weighting scheme
Let's do some (model) training!
Spark decision tree model
Spark Naive Bayes model
Spark random forest model
Spark GBM model
Super-learner model
Super learner
Composing all transformations together
Using the super-learner model
Summary
Word2vec for Prediction and Clustering
Motivation of word vectors
Word2vec explained
What is a word vector?
The CBOW model
The skip-gram model
Fun with word vectors
Cosine similarity
Doc2vec explained
The distributed-memory model
The distributed bag-of-words model
Applying word2vec and exploring our data with vectors
Creating document vectors
Supervised learning task
Summary
Extracting Patterns from Clickstream Data
Frequent pattern mining
Pattern mining terminology
Frequent pattern mining problem
The association rule mining problem
The sequential pattern mining problem
Pattern mining with Spark MLlib
Frequent pattern mining with FP-growth
Association rule mining
Sequential pattern mining with prefix span
Pattern mining on MSNBC clickstream data
Deploying a pattern mining application
The Spark Streaming module
Summary
Graph Analytics with GraphX
Basic graph theory
Graphs
Directed and undirected graphs
Order and degree
Directed acyclic graphs
Connected components
Trees
Multigraphs
Property graphs
GraphX distributed graph processing engine
Graph representation in GraphX
Graph properties and operations
Building and loading graphs
Visualizing graphs with Gephi
Gephi
Creating GEXF files from GraphX graphs
Advanced graph processing
Aggregating messages
Pregel
GraphFrames
Graph algorithms and applications
Clustering
Vertex importance
GraphX in context
Summary
Lending Club Loan Prediction
Motivation
Goal
Data
Data dictionary
Preparation of the environment
Data load
Exploration – data analysis
Basic clean up
Useless columns
String columns
Loan progress columns
Categorical columns
Text columns
Missing data
Prediction targets
Loan status model
Base model
The emp_title column transformation
The desc column transformation
Interest RateModel
Using models for scoring
Model deployment
Stream creation
Stream transformation
Stream output
Summary
Big data – that was our motivation to explore the world of machine learning with Spark a couple of years ago. We wanted to build machine learning applications that would leverag models trained on large amounts of data, but the beginning was not easy. Spark was still evolving, it did not contain a powerful machine learning library, and we were still trying to figure out what it means to build a machine learning application.
But, step by step, we started to explore different corners of the Spark ecosystem and followed Spark’s evolution. For us, the crucial part was a powerful machine learning library, which would provide features such as R or Python libraries did. This was an easy task for us, since we are actively involved in the development of H2O’s machine learning library and its branch called Sparkling Water, which enables the use of the H2O library from Spark applications. However, model training is just the tip of the machine learning iceberg. We still had to explore how to connect Sparkling Water to Spark RDDs, DataFrames, and DataSets, how to connect Spark to different data sources and read data, or how to export models and reuse them in different applications.
During our journey, Spark evolved as well. Originally, being a pure Scala project, it started to expose Python and, later, R interfaces. It also took its Spark API on a long journey from low-level RDDs to a high-level DataSet, exposing a SQL-like interface. Furthermore, Spark also introduced the concept of machine learning pipelines, adopted from the scikit-learn library known from Python. All these improvements made Spark a great tool for data transformation and data processing.
Based on this experience, we decided to share our knowledge with the rest of the world via this book. Its intention is simple: to demonstrate different aspects of building Spark machine learning applications on examples, and show how to use not only the latest Spark features, but also low-level Spark interfaces. On our journey, we also figure out many tricks and shortcuts not only connected to Spark, but also to the process of developing machine learning applications or source code organization. And all of them are shared in this book to help keep readers from making the mistakes we made.
The book adopted the Scala language as the main implementation language for our examples. It was a hard decision between using Python and Scala, but at the end Scala won. There were two main reasons to use Scala: it provides the most mature Spark interface and most applications deployed in production use Scala, mostly because of its performance benefits due to the JVM. Moreover, all source code shown in this book is also available online.
We hope you enjoy our book and it helps you navigate the Spark world and the development of machine learning applications.
Chapter 1, Introduction to Large-Scale Machine Learning, invites readers into the land of machine learning and big data, introduces historical paradigms, and describes contemporary tools, including Apache Spark and H2O.
Chapter 2, Detecting Dark Matter: The Higgs-Boson Particle, focuses on the training and evaluation of binomial models.
Chapter 3, Ensemble Methods for Multi-Class Classification, checks into a gym and tries to predict human activities based on data collected from body sensors.
Chapter 4, Predicting Movie Reviews Using NLP, introduces the problem of nature language processing with Spark and demonstrates its power on the sentiment analysis of movie reviews.
Chapter 5, Online Learning with Word2Vec, goes into detail about contemporary NLP techniques.
Chapter 6, Extracting Patterns from Clickstream Data, introduces the basics of frequent pattern mining and three algorithms available in Spark MLlib, before deploying one of these algorithms in a Spark Streaming application.
Chapter 7, Graph Analytics with GraphX, familiarizes the reader with the basic concepts of graphs and graph analytics, explains the core functionality of Spark GraphX, and introduces graph algorithms such as PageRank.
Chapter 8, Lending Club Loan Prediction, combines all the tricks introduced in the previous chapters into end-to-end examples, including data processing, model search and training, and model deployment as a Spark Streaming application.
Code samples provided in this book use Apache Spark 2.1 and its Scala API. Furthermore, we utilize the Sparkling Water package to access the H2O machine learning library. In each chapter, we show how to start Spark using spark-shell, and also how to download the data necessary to run the code.
In summary, the basic requirements to run the code provided in this book include:
Java 8
Spark 2.1
Are you a developer with a background in machine learning and statistics who is feeling limited by the current slow and small data machine learning tools? Then this is the book for you! In this book, you will create scalable machine learning applications to power a modern data-driven business using Spark. We assume that you already know about machine learning concepts and algorithms and have Spark up and running (whether on a cluster or locally), as well as having basic knowledge of the various libraries contained in Spark.
Feedback from our readers is always welcome. Let us know what you think about this book-what you liked or disliked. Reader feedback is important for us as it helps us develop titles that you will really get the most out of. To send us general feedback, simply email [email protected], and mention the book's title in the subject of your message. If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, see our author guide at www.packtpub.com/authors.
Now that you are the proud owner of a Packt book, we have a number of things to help you to get the most from your purchase.
You can download the example code files for this book from your account at http://www.packtpub.com. If you purchased this book elsewhere, you can visit http://www.packtpub.com/support and register to have the files emailed directly to you. You can download the code files by following these steps:
Log in or register to our website using your email address and password.
Hover the mouse pointer on the
SUPPORT
tab at the top.
Click on
Code Downloads & Errata
.
Enter the name of the book in the
Search
box.
Select the book for which you're looking to download the code files.
Choose from the drop-down menu where you purchased this book from.
Click on
Code Download
.
Once the file is downloaded, please make sure that you unzip or extract the folder using the latest version of:
WinRAR / 7-Zip for Windows
Zipeg / iZip / UnRarX for Mac
7-Zip / PeaZip for Linux
The code bundle for the book is also hosted on GitHub at https://github.com/PacktPublishing/Mastering-Machine-Learning-with-Spark-2.x. We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/. Check them out!
We also provide you with a PDF file that has color images of the screenshots/diagrams used in this book. The color images will help you better understand the changes in the output. You can download this file from https://www.packtpub.com/sites/default/files/downloads/MasteringMachineLearningwithSpark2.x_ColorImages.pdf.
Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you find a mistake in one of our books-maybe a mistake in the text or the code-we would be grateful if you could report this to us. By doing so, you can save other readers from frustration and help us improve subsequent versions of this book. If you find any errata, please report them by visiting http://www.packtpub.com/submit-errata, selecting your book, clicking on the Errata Submission Form link, and entering the details of your errata. Once your errata are verified, your submission will be accepted and the errata will be uploaded to our website or added to any list of existing errata under the Errata section of that title. To view the previously submitted errata, go to https://www.packtpub.com/books/content/support and enter the name of the book in the search field. The required information will appear under the Errata section.
Piracy of copyrighted material on the internet is an ongoing problem across all media. At Packt, we take the protection of our copyright and licenses very seriously. If you come across any illegal copies of our works in any form on the internet, please provide us with the location address or website name immediately so that we can pursue a remedy. Please contact us at [email protected] with a link to the suspected pirated material. We appreciate your help in protecting our authors and our ability to bring you valuable content.
If you have a problem with any aspect of this book, you can contact us at [email protected], and we will do our best to address the problem.
--Peter Sondergaard, Gartner Research
By 2018, it is estimated that companies will spend $114 billion on big data-related projects, an increase of roughly 300%, compared to 2013 (https://www.capgemini-consulting.com/resource-file-access/resource/pdf/big_data_pov_03-02-15.pdf). Much of this increase in expenditure is due to how much data is being created and how we are better able to store such data by leveraging distributed filesystems such as Hadoop.
However, collecting the data is only half the battle; the other half involves data extraction, transformation, and loading into a computation system, which leverage the power of modern computers to apply various mathematical methods in order to learn more about data and patterns, and extract useful information to make relevant decisions. The entire data workflow has been boosted in the last few years by not only increasing the computation power and providing easily accessible and scalable cloud services (for example, Amazon AWS, Microsoft Azure, and Heroku) but also by a number of tools and libraries that help to easily manage, control, and scale infrastructure and build applications. Such a growth in the computation power also helps to process larger amounts of data and to apply algorithms that were impossible to apply earlier. Finally, various computation-expensive statistical or machine learning algorithms have started to help extract nuggets of information from data.
One of the first well-adopted big data technologies was Hadoop, which allows for the MapReduce computation by saving intermediate results on a disk. However, it still lacks proper big data tools for information extraction. Nevertheless, Hadoop was just the beginning. With the growing size of machine memory, new in-memory computation frameworks appeared, and they also started to provide basic support for conducting data analysis and modeling—for example, SystemML or Spark ML for Spark and FlinkML for Flink. These frameworks represent only the tip of the iceberg—there is a lot more in the big data ecosystem, and it is permanently evolving, since the volume of data is constantly growing, demanding new big data algorithms and processing methods. For example, the Internet of Things (IoT) represents a new domain that produces huge amount of streaming data from various sources (for example, home security system, Alexa Echo, or vital sensors) and brings not only an unlimited potential to mind useful information from data, but also demands new kind of data processing and modeling methods.
Nevertheless, in this chapter, we will start from the beginning and explain the following topics:
Basic working tasks of data scientists
Aspect of big data computation in distributed environment
The big data ecosystem
Spark and its machine learning support
Finding a uniform definition of data science, however, is akin to tasting wine and comparing flavor profiles among friends—everyone has their own definition and no one description is more accurate than the other. At its core, however, data science is the art of asking intelligent questions about data and receiving intelligent answers that matter to key stakeholders. Unfortunately, the opposite also holds true—ask lousy questions of the data and get lousy answers! Therefore, careful formulation of the question is the key for extracting valuable insights from your data. For this reason, companies are now hiring data scientists to help formulate and ask these questions.
At first, it's easy to paint a stereotypical picture of what a typical data scientist looks like: t-shirt, sweatpants, thick-rimmed glasses, and debugging a chunk of code in IntelliJ... you get the idea. Aesthetics aside, what are some of the traits of a data scientist? One of our favorite posters describing this role is shown here in the following diagram:
Math, statistics, and general knowledge of computer science is given, but one pitfall that we see among practitioners has to do with understanding the business problem, which goes back to asking intelligent questions of the data. It cannot be emphasized enough: asking more intelligent questions of the data is a function of the data scientist's understanding of the business problem and the limitations of the data; without this fundamental understanding, even the most intelligent algorithm would be unable to come to solid conclusions based on a wobbly foundation.
This will probably come as a shock to some of you—being a data scientist is more than reading academic papers, researching new tools, and model building until the wee hours of the morning, fueled on espresso; in fact, this is only a small percentage of the time that a data scientist gets to truly play (the espresso part however is 100% true for everyone)! Most part of the day, however, is spent in meetings, gaining a better understanding of the business problem(s), crunching the data to learn its limitations (take heart, this book will expose you to a ton of different feature engineering or feature extractions tasks), and how best to present the findings to non data-sciencey people. This is where the true sausage making process takes place, and the best data scientists are the ones who relish in this process because they are gaining more understanding of the requirements and benchmarks for success. In fact, we could literally write a whole new book describing this process from top-to-tail!
So, what (and who) is involved in asking questions about data? Sometimes, it is process of saving data into a relational database and running SQL queries to find insights into data: "for the millions of users that bought this particular product, what are the top 3 OTHER products also bought?" Other times, the question is more complex, such as, "Given the review of a movie, is this a positive or negative review?" This book is mainly focused on complex questions, like the latter. Answering these types of questions is where businesses really get the most impact from their big data projects and is also where we see a proliferation of emerging technologies that look to make this Q and A system easier, with more functionality.
Some of the most popular, open source frameworks that look to help answer data questions include R, Python, Julia, and Octave, all of which perform reasonably well with small (X < 100 GB) datasets. At this point, it's worth stopping and pointing out a clear distinction between big versus small data. Our general rule of thumb in the office goes as follows:
If you can open your dataset using Excel, you are working with small data.
What happens when the dataset in question is so vast that it cannot fit into the memory of a single computer and must be distributed across a number of nodes in a large computing cluster? Can't we just rewrite some R code, for example, and extend it to account for more than a single-node computation? If only things were that simple! There are many reasons why the scaling of algorithms to more machines is difficult. Imagine a simple example of a file containing a list of names:
BDXADA
We would like to compute the number of occurrences of individual words in the file. If the file fits into a single machine, you can easily compute the number of occurrences by using a combination of the Unix tools, sort and uniq:
bash> sort file | uniq -c
The output is as shown ahead:
2 A
1 B
1 D
1 X
However, if the file is huge and distributed over multiple machines, it is necessary to adopt a slightly different computation strategy. For example, compute the number of occurrences of individual words for every part of the file that fits into the memory and merge the results together. Hence, even simple tasks, such as counting the occurrences of names, in a distributed environment can become more complicated.
Machine learning algorithms combine simple tasks into complex patterns, that are even more complicated in distributed environment. Let's take a simple decision tree algorithm (reference), for example. This particular algorithm creates a binary tree that tries to fit training data and minimize prediction errors. However, in order to do this, it has to decide about the branch of tree it has to send every data point to (don't worry, we'll cover the mechanics of how this algorithm works along with some very useful parameters that you can learn in later in the book). Let's demonstrate it with a simple example:
Consider the situation depicted in preceding figure. A two-dimensional board with many points colored in two colors: red and blue. The goal of the decision tree is to learn and generalize the shape of data and help decide about the color of a new point. In our example, we can easily see that the points almost follow a chessboard pattern. However, the algorithm has to figure out the structure by itself. It starts by finding the best position of a vertical or horizontal line, which would separate the red points from the blue points.
The found decision is stored in the tree root and the steps are recursively applied on both the partitions. The algorithm ends when there is a single point in the partition:
With a growing amount of data, the single-machine tools were not able to satisfy the industry needs and thereby created a space for new data processing methods and tools, especially Hadoop MapReduce, which is based on an idea originally described in the Google paper, MapReduce: Simplified Data Processing on Large Clusters (https://research.google.com/archive/mapreduce.html). On the other hand, it is a generic framework without any explicit support or libraries to create machine learning workflows. Another limitation of classical MapReduce is that it performs many disk I/O operations during the computation instead of benefiting from machine memory.
As you have seen, there are several existing machine learning tools and distributed platforms, but none of them is an exact match for performing machine learning tasks with large data and distributed environment. All these claims open the doors for Apache Spark.
Enter the room, Apache Spark!
Created in 2010 at the UC Berkeley AMP Lab (Algorithms, Machines, People), the Apache Spark project was built with an eye for speed, ease of use, and advanced analytics. One key difference between Spark and other distributed frameworks such as Hadoop is that datasets can be cached in memory, which lends itself nicely to machine learning, given its iterative nature (more on this later!) and how data scientists are constantly accessing the same data many times over.
Spark can be run in a variety of ways, such as the following:
Local mode:
This entails a single
Java Virtual Machine
(
JVM
) executed on a single host
Standalone Spark cluster:
This entails multiple JVMs on multiple hosts
Via resource manager such as Yarn/Mesos:
This application deployment is driven by a resource manager, which controls the allocation of nodes, application, distribution, and deployment
If you know about the Spark project, then chances are high that you have also heard of a company called Databricks. However, you might not know how Databricks and the Spark project are related to one another. In short, Databricks was founded by the creators of the Apache Spark project and accounts for over 75% of the code base for the Spark project. Aside from being a huge force behind the Spark project with respect to development, Databricks also offers various certifications in Spark for developers, administrators, trainers, and analysts alike. However, Databricks is not the only main contributor to the code base; companies such as IBM, Cloudera, and Microsoft also actively participate in Apache Spark development.
As a side note, Databricks also organizes the Spark Summit (in both Europe and the US), which is the premier Spark conference and a great place to learn about the latest developments in the project and how others are using Spark within their ecosystem.
Throughout this book, we will give recommended links that we read daily that offer great insights and also important changes with respect to the new versions of Spark. One of the best resources here is the Databricks blog, which is constantly being updated with great content. Be sure to regularly check this out at https://databricks.com/blog.
Also, here is a link to see the past Spark Summit talks, which you may find helpful:http://slideshare.net/databricks.
So, you have downloaded the latest version of Spark (depending on how you plan on launching Spark) and you have run the standard Hello, World! example....what now?!
Spark comes equipped with five libraries, which can be used separately--or in unison--depending on the task we are trying to solve. Note that in this book, we plan on using a variety of different libraries, all within the same application so that you will have the maximum exposure to the Spark platform and better understand the benefits (and limitations) of each library. These five libraries are as follows:
Core
: This is the Spark core infrastructure, providing primitives to represent and store data called
Resilient Distributed Dataset
(
RDDs
) and manipulate data with tasks and jobs.
SQL
: This library provides user-friendly API over core RDDs by introducing DataFrames and SQL to manipulate with the data stored.
MLlib (Machine Learning Library)
: This is Spark's very own machine learning library of algorithms developed in-house that can be used within your Spark application.
Graphx
: This is used for graphs and graph-calculations; we will explore this particular library in depth in a later chapter.
Streaming
: This library allows real-time streaming of data from various sources, such as Kafka, Twitter, Flume, and TCP sockets, to name a few. Many of the applications we will build in this book will leverage the MLlib and Streaming libraries to build our applications.
The Spark platform can also be extended by third-party packages. There are many of them, for example, support for reading CSV or Avro files, integration with Redshift, and Sparkling Water, which encapsulates the H2O machine learning library.
H2O is an open source, machine learning platform that plays extremely well with Spark; in fact, it was one of the first third-party packages deemed "Certified on Spark".
Sparkling Water (H2O + Spark) is H2O's integration of their platform within the Spark project, which combines the machine learning capabilities of H2O with all the functionality of Spark. This means that users can run H2O algorithms on Spark RDD/DataFrame for both exploration and deployment purposes. This is made possible because H2O and Spark share the same JVM, which allows for seamless transitions between the two platforms. H2O stores data in the H2O frame, which is a columnar-compressed representation of your dataset that can be created from Spark RDD and/or DataFrame. Throughout much of this book, we will be referencing algorithms from Spark's MLlib library and H2O's platform, showing how to use both the libraries to get the best results possible for a given task.
The following is a summary of the features Sparkling Water comes equipped with:
Use of H2O algorithms within a Spark workflow
Transformations between Spark and H2O data structures
Use of Spark RDD and/or DataFrame as inputs to H2O algorithms
Use of H2O frames as inputs into MLlib algorithms (will come in handy when we do feature engineering later)
Transparent execution of Sparkling Water applications on top of Spark (for example, we can run a Sparkling Water application within a Spark stream)
The H2O user interface to explore Spark data
Sparkling Water is designed to be executed as a regular Spark application. Consequently, it is launched inside a Spark executor created after submitting the application. At this point, H2O starts services, including a distributed key-value (K/V) store and memory manager, and orchestrates them into a cloud. The topology of the created cloud follows the topology of the underlying Spark cluster.
As stated previously, Sparkling Water enables transformation between different types of RDDs/DataFrames and H2O's frame, and vice versa. When converting from a hex frame to an RDD, a wrapper is created around the hex frame to provide an RDD-like API. In this case, data is not duplicated but served directly from the underlying hex frame. Converting from an RDD/DataFrame to a H2O frame requires data duplication because it transforms data from Spark into H2O-specific storage. However, data stored in an H2O frame is heavily compressed and does not need to be preserved as an RDD anymore:
As stated previously, MLlib is a library of popular machine learning algorithms built using Spark. Not surprisingly, H2O and MLlib share many of the same algorithms but differ in both their implementation and functionality. One very handy feature of H2O is that it allows users to visualize their data and perform feature engineering tasks, which we will cover in depth in later chapters. The visualization of data is accomplished by a web-friendly GUI and allows users a friendly interface to seamlessly switch between a code shell and a notebook-friendly environment. The following is an example of the H2O notebook - called Flow - that you will become familiar with soon:
One other nice addition is that H2O allows data scientists to grid search many hyper-parameters that ship with their algorithms. Grid search is a way of optimizing all the hyperparameters of an algorithm to make model configuration easier. Often, it is difficult to know which hyperparameters to change and how to change them; the grid search allows us to explore many hyperparameters simultaneously, measure the output, and help select the best models based on our quality requirements. The H2O grid search can be combined with model cross-validation and various stopping criteria, resulting in advanced strategies such as picking 1000 random parameters from a huge parameters hyperspace and finding the best model that can be trained under two minutes and with AUC greater than 0.7
Raw data for problems often comes from multiple sources with different and often incompatible formats. The beauty of the Spark programming model is its ability to define data operations that process the incoming data and transform it into a regular form that can be used for further feature engineering and model building. This process is commonly referred to as data munging and is where much of the battle is won with respect to data science projects. We keep this section intentionally brief because the best way to showcase the power--and necessity!--of data munging is by example. So, take heart; we have plenty of practice to go through in this book, which emphasizes this essential process.
Often, the process flow of many big data projects is iterative, which means a lot of back-and-forth testing new ideas, new features to include, tweaking various hyper-parameters, and so on, with a fail fast attitude. The end result of these projects is usually a model that can answer a question being posed. Notice that we didn't say accurately answer a question being posed! One pitfall of many data scientists these days is their inability to generalize a model for new data, meaning that they have overfit their data so that the model provides poor results when given new data. Accuracy is extremely task-dependent and is usually dictated by the business needs with some sensitivity analysis being done to weigh the cost-benefits of the model outcomes. However, there are a few standard accuracy measures that we will go over throughout this book so that you can compare various models to see how changes to the model impact the result.
In this chapter, we wanted to give you a brief glimpse into the life of a data scientist, what this entails, and some of the challenges that data scientists consistently face. In light of these challenges, we feel that the Apache Spark project is ideally positioned to help tackle these topics, which range from data ingestion and feature extraction/creation to model building and deployment. We intentionally kept this chapter short and light on verbiage because we feel working through examples and different use cases is a better use of time as opposed to speaking abstractly and at length about a given data science topic. Throughout the rest of this book, we will focus solely on this process while giving best-practice tips and recommended reading along the way for users who wish to learn more. Remember that before embarking on your next data science project, be sure to clearly define the problem beforehand, so you can ask an intelligent question of your data and (hopefully) get an intelligent answer!
One awesome website for all things data science is KDnuggets (http://www.kdnuggets.com). Here's a great article on the language all data scientists must learn in order to be successful (http://www.kdnuggets.com/2015/09/one-language-data-scientist-must-master.html).
True or false? Positive or negative? Pass or no pass? User clicks on the ad versus not clicking the ad? If you've ever asked/encountered these questions before then you are already familiar with the concept of binary classification.
At it's core, binary classification - also referred to as binomial classification - attempts to categorize a set of elements into two distinct groups using a classification rule, which in our case, can be a machine learning algorithm. This chapter shows how to deal with it in the context of Spark and big data. We are going to explain and demonstrate:
Spark MLlib models for binary classification including decision trees, random forest, and the gradient boosted machine
Binary classification support in H2O
Searching for the best model in a hyperspace of parameters
Evaluation metrics for binomial models
Binary classifiers have intuitive interpretation since they are trying to separate data points into two groups. This sounds simple, but we need to have some notion of measuring the quality of this separation. Furthermore, one important characteristic of a binary classification problem is that, often, the proportion of one group of labels versus the other can be disproportionate. That means the dataset may be imbalanced with respect to one label which necessitates careful interpretation by the data scientist.
