34,79 €
Develop a range of cutting-edge machine learning projects with Apache Spark using this actionable guide
If you are a data scientist, a data analyst, or an R and SPSS user with a good understanding of machine learning concepts, algorithms, and techniques, then this is the book for you. Some basic understanding of Spark and its core elements and application is required.
There's a reason why Apache Spark has become one of the most popular tools in Machine Learning – its ability to handle huge datasets at an impressive speed means you can be much more responsive to the data at your disposal. This book shows you Spark at its very best, demonstrating how to connect it with R and unlock maximum value not only from the tool but also from your data.
Packed with a range of project "blueprints" that demonstrate some of the most interesting challenges that Spark can help you tackle, you'll find out how to use Spark notebooks and access, clean, and join different datasets before putting your knowledge into practice with some real-world projects, in which you will see how Spark Machine Learning can help you with everything from fraud detection to analyzing customer attrition. You'll also find out how to build a recommendation engine using Spark's parallel computing powers.
This book offers a step-by-step approach to setting up Apache Spark, and use other analytical tools with it to process Big Data and build machine learning projects.The initial chapters focus more on the theory aspect of machine learning with Spark, while each of the later chapters focuses on building standalone projects using Spark.
Sie lesen das E-Book in den Legimi-Apps auf:
Seitenzahl: 303
Veröffentlichungsjahr: 2016
Copyright © 2016 Packt Publishing
All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.
Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author nor Packt Publishing, and its dealers and distributors will be held liable for any damages caused or alleged to be caused directly or indirectly by this book.
Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.
First published: May 2016
Production reference: 1250516
Published by Packt Publishing Ltd.
Livery Place
35 Livery Street
Birmingham B3 2PB, UK.
ISBN 978-1-78588-039-1
www.packtpub.com
Author
Alex Liu
Reviewer
Hao Ren
Commissioning Editor
Dipika Gaonkar
Acquisition Editor
Meeta Rajani
Content Development Editor
Anish Sukumaran
Technical Editors
Dhiraj Chandanshive
Siddhesh Patil
Copy Editor
Shruti Iyer
Project Coordinator
Izzat Contractor
Proofreader
Safis Editing
Indexer
Mariammal Chettiyar
Graphics
Disha Haria
Production Coordinator
Nilesh R. Mohite
Cover Work
Nilesh R. Mohite
Alex Liu is an expert in research methods and data science. He is currently one of IBM's leading experts in big data analytics and also a lead data scientist, where he serves big corporations, develops big data analytics IPs, and speaks at industrial conferences such as STRATA, Insights, SMAC, and BigDataCamp. In the past, Alex served as chief or lead data scientist for a few companies, including Yapstone, RS, and TRG. Before this, he was a lead consultant and director at RMA, where he provided data analytics consultation and training to many well-known organizations, including the United Nations, Indymac, AOL, Ingram Micro, GEM, Farmers Insurance, Scripps Networks, Sears, and USAID. At the same time, Dr. Liu taught advanced research methods to PhD candidates at University of Southern California and University of California at Irvine. Before this, he worked as a managing director for CATE/GEC and as a research fellow for the Asia/Pacific Research Center at Stanford University. Alex has a Ph.D. in quantitative sociology and a master's degree of science in statistical computing from Stanford University.
I would like to thank IBM for providing a great open and innovative environment to learn and practice Big Data analytics. I would especially like to thank my managers, Kim Siegel and Kevin Zachary, for their support and encouragement, without which it would not have been possible to complete this book.
I also would like to thank my beautiful wife, Lauria, and two beautiful daughters, Kate and Khloe, for their patience and support, which enabled me to work effectively. Finally, I would like to thank the Packt staff, especially Anish Sukumaran and Meeta Rajani, for making the writing and editing process smooth and joyful.
Hao Ren is data engineer working in Paris for a classified advertising website named leboncoin (https://www.leboncoin.fr/), which is the fifth most visited site in France. Three years' work experience of functional programming in Scala, Machine Learning, and distributed systems defines his career. Hao's main speciality is based on machine learning with Apache Spark, such as building a crawler detection system, recommander system, and so on. He has also reviewed a more detailed and advanced book by Packt Publishing, Machine Learning with Spark, which is worth a read as well.
Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.PacktPub.com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at <[email protected]> for more details.
At www.PacktPub.com, you can also read a collection of free technical articles, sign up for a range of free newsletters and receive exclusive discounts and offers on Packt books and eBooks.
https://www2.packtpub.com/books/subscription/packtlib
Do you need instant solutions to your IT questions? PacktLib is Packt's online digital book library. Here, you can search, access, and read Packt's entire library of books.
As data scientists and machine learning professionals, our jobs are to build models for detecting frauds, predicting customer churns, or turning data into insights in a broad sense; for this, we sometimes need to process huge amounts of data and handle complicated computations. Therefore, we are always excited to see new computing tools, such as Spark, and spend a lot of time learning about them. To learn about these new tools, a lot of learning materials are available, but they are from a more computing perspective, and often written by computer scientists.
We, the data scientists and machine learning professionals, as users of Spark, are more concerned about how the new systems can help us build models with more predictive accuracy and how these systems can make data processing and coding easy for us. This is the main reason why this book has been developed and why this book has been written by a data scientist.
At the same time, we, as data scientists and machine learning professionals, have already developed our frameworks and processes as well as used some good model building tools, such as R and SPSS. We understand that some of the new tools, such as MLlib of Spark, may replace certain old tools, but not all of them. Therefore, using Spark together with our existing tools is essential to us as users of Spark and becomes one of the main focuses for this book, which is also one of the critical elements, making this book different from other Spark books.
Overall, this is a Spark book written by a data scientist for data scientists and machine learning professionals to make machine learning easy for us with Spark.
Chapter 1, Spark for Machine Learning, introduces Apache Spark from a machine learning perspective. We will discuss Spark dataframes and R, Spark pipelines, RM4Es data science framework, as well as the Spark notebook and implementation models.
Chapter 2, Data Preparation for Spark ML, focuses on data preparation for machine learning on Apache Spark with tools such as Spark SQL. We will discuss data cleaning, identity matching, data merging, and feature development.
Chapter 3, A Holistic View on Spark, clearly explains the RM4E machine learning framework and processes with a real-life example and also demonstrates the benefits of obtaining holistic views for businesses easily with Spark.
Chapter 4, Fraud Detection on Spark, discusses how Spark makes machine learning for fraud detection easy and fast. At the same time, we will illustrate a step-by-step process of obtaining fraud insights from big data.
Chapter 5, Risk Scoring on Spark, reviews machine learning methods and processes for a risk scoring project and implements them using R notebooks on Apache Spark in a special DataScientistWorkbench environment. Our focus for this chapter is the notebook.
Chapter 6, Churn Prediction on Spark, further illustrates our special step-by-step machine learning process on Spark with a focus on using MLlib to develop customer churn predictions to improve customer retention.
Chapter 7, Recommendations on Spark, describes how to develop recommendations with big data on Spark by utilizing SPSS on the Spark system.
Chapter 8, Learning Analytics on Spark, extends our application to serve learning organizations like universities and training institutions, for which we will apply machine learning to improve learning analytics for a real case of predicting student attrition.
Chapter 9, City Analytics on Spark, helps the readers to gain a better understanding about how Apache Spark could be utilized not only for commercial use, but also for public use as to serve cities with a real use case of predicting service requests on Spark.
Chapter 10, Learning Telco Data on Spark, further extends what was studied in the previous chapters and allows readers to combine what was learned for a dynamic machine learning with a huge amount of Telco Data on Spark.
Chapter 11, Modeling Open Data on Spark, presents dynamic machine learning with open data on Spark from which users can take a data-driven approach and utilize all the technologies available for optimal results. This chapter is an extension of Chapter 9, City Analytics on Spark, and Chapter 10, Learning Telco Data on Spark, as well as a good review of all the previous chapters with a real-life project.
Throughout this book, we assume that you have some basic experience of programming, either in Scala or Python; some basic experience with modeling tools, such as R or SPSS; and some basic knowledge of machine learning and data science.
This book is written for analysts, data scientists, researchers, and machine learning professionals who need to process Big Data but who are not necessarily familiar with Spark.
Feedback from our readers is always welcome. Let us know what you think about this book—what you liked or disliked. Reader feedback is important for us as it helps us develop titles that you will really get the most out of.
To send us general feedback, simply e-mail <[email protected]>, and mention the book's title in the subject of your message.
If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, see our author guide at www.packtpub.com/authors.
Now that you are the proud owner of a Packt book, we have a number of things to help you to get the most from your purchase.
We also provide you with a PDF file that has color images of the screenshots/diagrams used in this book. The color images will help you better understand the changes in the output. You can download this file from http://www.packtpub.com/sites/default/files/downloads/ApacheSparkMachineLearningBlueprints_ColorImages.pdf.
Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you find a mistake in one of our books—maybe a mistake in the text or the code—we would be grateful if you could report this to us. By doing so, you can save other readers from frustration and help us improve subsequent versions of this book. If you find any errata, please report them by visiting http://www.packtpub.com/submit-errata, selecting your book, clicking on the Errata Submission Form link, and entering the details of your errata. Once your errata are verified, your submission will be accepted and the errata will be uploaded to our website or added to any list of existing errata under the Errata section of that title.
To view the previously submitted errata, go to https://www.packtpub.com/books/content/support and enter the name of the book in the search field. The required information will appear under the Errata section.
Piracy of copyrighted material on the Internet is an ongoing problem across all media. At Packt, we take the protection of our copyright and licenses very seriously. If you come across any illegal copies of our works in any form on the Internet, please provide us with the location address or website name immediately so that we can pursue a remedy.
Please contact us at <[email protected]> with a link to the suspected pirated material.
We appreciate your help in protecting our authors and our ability to bring you valuable content.
If you have a problem with any aspect of this book, you can contact us at <[email protected]>, and we will do our best to address the problem.
This chapter provides an introduction to Apache Spark from a Machine Learning (ML) and data analytics perspective, and also discusses machine learning in relation to Spark computing. Here, we first present an overview of Apache Spark, as well as Spark's advantages for data analytics, in comparison to MapReduce and other computing platforms. Then we discuss five main issues, as below:
All of the above are the most important topics that any data scientist or machine learning professional is expected to master, in order to fully take advantage of Apache Spark computing. Specifically, this chapter will cover all of the following six topics.
In this section, we provide an overview of the Apache Spark computing platform and a discussion about some advantages of utilizing Apache Spark, in comparison to using other computing platforms like MapReduce. Then, we briefly discuss how Spark computing fits modern machine learning and big data analytics.
After this section, readers will form a basic understanding of Apache Spark as well as a good understanding of some important machine learning benefits from utilizing Apache Spark.
Apache Spark is a computing framework for the fast processing of big data. This framework contains a distributed computing engine and a specially designed programming model. Spark was started as a research project at the AMPLab of the University of California at Berkeley in 2009, and then in 2010 it became fully open sourced as it was donated to the Apache Software Foundation. Since then, Apache Spark has experienced exponential growth, and now Spark is the most active open source project in the big data field.
Spark's computing utilizes an in-memory distributed computational approach, which makes Spark computing among the fastest, especially for iterative computation. It can run up to 100 times faster than Hadoop MapReduce, according to many tests that have been performed.
Apache Spark has a unified platform, which consists of the Spark core engine and four libraries: Spark SQL, Spark Streaming, MLlib, and GraphX. All of these four libraries have Python, Java and Scala programming APIs.
Besides the above mentioned four built-in libraries, there are also tens of packages available for Apache Spark, provided by third parties, which can be used for handling data sources, machine learning, and other tasks.
Apache Spark has a 3 month circle for new releases, with Spark version 1.6.0 released on January 4 of 2016. Apache Spark release 1.3 had DataFrames API and ML Pipelines API included. Starting from Apache Spark release 1.4, the R interface (SparkR) is included as default.
To download Apache Spark, readers should go to http://spark.apache.org/downloads.html.
To install Apache Spark and start running it, readers should consult its latest documentation at http://spark.apache.org/docs/latest/.
Apache Spark has many advantages over MapReduce and other big data computing platforms. Among them, the distinguished two are that it is fast to run and fast to write.
Overall, Apache Spark has kept some of MapReduce's most important advantages like that of scalability and fault tolerance, but extended them greatly with new technologies.
In comparison to MapReduce, Apache Spark's engine is capable of executing a more general Directed Acyclic Graph (DAG) of operators. Therefore, when using Apache Spark to execute MapReduce-style graphs, users can achieve higher performance batch processing in Hadoop.
Apache Spark has in-memory processing capabilities, and uses a new data abstraction method, Resilient Distributed Dataset (RDD), which enables highly iterative computing and reactive applications. This also extended its fault tolerance capability.
At the same time, Apache Spark has made complex pipeline representation easy with only a few lines of code needed. It is best known for the ease with which it can be used to create algorithms that capture insight from complex and even messy data, and also enable users to apply that insight in-time to drive outcomes.
As summarized by the Apache Spark team, Spark enables:
To a practical data scientist working with the above, Apache Spark easily demonstrates its advantages when it is adopted for:
Most users are satisfied with Apache Spark's advantages in speed and performance, but some also noted that Apache Spark is still in the process of maturing.
http://www.svds.com/use-cases-for-apache-spark/ has some examples of materialized Spark benefits.
With its innovations on RDD and in-memory processing, Apache Spark has truly made distributed computing easily accessible to data scientists and machine learning professionals. According to the Apache Spark team, Apache Spark runs on the Mesos cluster manager, letting it share resources with Hadoop and other applications. Therefore, Apache Spark can read from any Hadoop input source like HDFS.
For the above, the Apache Spark computing model is very suitable to distributed computing for machine learning. Especially for rapid interactive machine learning, parallel computing, and complicated modelling at scale, Apache Spark should definitely be utilized.
According to the Spark development team, Spark's philosophy is to make life easy and productive for data scientists and machine learning professionals. Due to this, Apache Spark has:
Per the introduction by Patrick Wendell, co-founder of Databricks, Spark is especially made for large scale data processing. Apache Spark supports agile data science to iterate rapidly, and Spark can be integrated with IBM and other solutions easily.
In this section, we review algorithms that are needed for machine learning, and introduce machine learning libraries including Spark's MLlib and IBM's SystemML, then we discuss their integration with Apache Spark.
After reading this section, readers will become familiar with various machine learning libraries including Spark's MLlib, and know how to make them ready for machine learning.
To complete a Machine Learning project, data scientists often employ some classification or regression algorithms to develop and evaluate predictive models, which are readily available in some Machine Learning tools like R or MatLab. To complete a machine learning project, besides data sets and computing platforms, these machine learning libraries, as collections of machine learning algorithms, are necessary.
For example, the strength and depth of the popular R mainly comes from the various algorithms that are readily provided for the use of Machine Learning professionals. The total number of R packages is over 1000. Data scientists do not need all of them, but do need some packages to:
According to a recent ComputerWorld survey, the most downloaded R packages are:
PACKAGE
# of DOWNLOADS
Rcpp
162778
ggplot2
146008
plyr
123889
stringr
120387
colorspace
118798
digest
113899
reshape2
109869
RColorBrewer
100623
scales
92448
manipulate
88664
For more info, please visit http://www.computerworld.com/article/2920117/business-intelligence/most-downloaded-r-packages-last-month.html
MLlib is Apache Spark's machine learning library. It is scalable, and consists of many commonly-used machine learning algorithms. Built-in to MLlib are algorithms for:
The Spark MLlib is still under active development, with new algorithms expected to be added for every new release.
In line with Apache Spark's computing philosophy, the MLlib is built for easy use and deployment, with high performance.
MLlib uses the linear algebra package Breeze, which depends on netlib-java, and jblas. The packages netlib-java and jblas also depend on native Fortran routines. Users need to install the gfortran runtime library if it is not already present on their nodes. MLlib will throw a linking error if it cannot detect these libraries automatically.
For MLlib use cases and further details on how to use MLlib, please visit:
http://spark.apache.org/docs/latest/mllib-guide.html.
As discussed in previous part, MLlib has made available many frequently used algorithms like regression and classification. But these basics are not enough for complicated machine learning.
If we wait for the Apache Spark team to add all the needed ML algorithms it may take a long time. For this, the good news is that many third parties have contributed ML libraries to Apache Spark.
IBM has contributed its machine learning library, SystemML, to Apache Spark.
Besides what MLlib provides, SystemML offers a lot more additional ML algorithms like the ones on missing data imputation, SVM, GLM, ARIMA, and non-linear optimizers, and some graphical modelling and matrix factonization algorithms.
As developed by the IBM Almaden Research group, IBM's SystemML is an engine for distributed machine learning and it can scale to arbitrary large data sizes. It provides the following benefits:
SystemML is modeled after R syntax and semantics, and provides the ability to author new algorithms via its own language.
Through a good integration with R by SparkR, Apache Spark users also have the potential to utilize thousands of R packages for machine learning algorithms, when needed. As will be discussed in later sections of this chapter, the SparkR notebook will make this operation very easy.
For more about IBM SystemML, please visit http://researcher.watson.ibm.com/researcher/files/us-ytian/systemML.pdf
