E-Book
34,79 €

Apache Spark Machine Learning Blueprints E-Book

Alex Liu

0,0

34,79 €

Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.

Mehr erfahren.

Herausgeber: Packt Publishing
Kategorie: Wissenschaft und neue Technologien
Sprache: Englisch

Beschreibung

Develop a range of cutting-edge machine learning projects with Apache Spark using this actionable guide

About This Book

Customize Apache Spark and R to fit your analytical needs in customer research, fraud detection, risk analytics, and recommendation engine development
Develop a set of practical Machine Learning applications that can be implemented in real-life projects
A comprehensive, project-based guide to improve and refine your predictive models for practical implementation

Who This Book Is For

If you are a data scientist, a data analyst, or an R and SPSS user with a good understanding of machine learning concepts, algorithms, and techniques, then this is the book for you. Some basic understanding of Spark and its core elements and application is required.

What You Will Learn

Set up Apache Spark for machine learning and discover its impressive processing power
Combine Spark and R to unlock detailed business insights essential for decision making
Build machine learning systems with Spark that can detect fraud and analyze financial risks
Build predictive models focusing on customer scoring and service ranking
Build a recommendation systems using SPSS on Apache Spark
Tackle parallel computing and find out how it can support your machine learning projects
Turn open data and communication data into actionable insights by making use of various forms of machine learning

In Detail

There's a reason why Apache Spark has become one of the most popular tools in Machine Learning – its ability to handle huge datasets at an impressive speed means you can be much more responsive to the data at your disposal. This book shows you Spark at its very best, demonstrating how to connect it with R and unlock maximum value not only from the tool but also from your data.

Packed with a range of project "blueprints" that demonstrate some of the most interesting challenges that Spark can help you tackle, you'll find out how to use Spark notebooks and access, clean, and join different datasets before putting your knowledge into practice with some real-world projects, in which you will see how Spark Machine Learning can help you with everything from fraud detection to analyzing customer attrition. You'll also find out how to build a recommendation engine using Spark's parallel computing powers.

Style and approach

This book offers a step-by-step approach to setting up Apache Spark, and use other analytical tools with it to process Big Data and build machine learning projects.The initial chapters focus more on the theory aspect of machine learning with Spark, while each of the later chapters focuses on building standalone projects using Spark.

Details

Sie lesen das E-Book in den Legimi-Apps auf:

Android

iOS

von Legimi
zertifizierten E-Readern

Seitenzahl: 303

Veröffentlichungsjahr: 2016

Bewertungen

0,0

Rezensionen(0 Rezensionen)

Leseprobe

Apache Spark Machine Learning Blueprints

Credits

About the Author

About the Reviewer

www.PacktPub.com

eBooks, discount offers, and more

Why subscribe?

Preface

What this book covers

What you need for this book

Who this book is for

Conventions

Reader feedback

Customer support

Downloading the color images of this book

Errata

Piracy

Questions

1. Spark for Machine Learning

Spark overview and Spark advantages

Spark overview

Spark advantages

Spark computing for machine learning

Machine learning algorithms

MLlib

Other ML libraries

Spark RDD and dataframes

Spark RDD

Spark dataframes

Dataframes API for R

ML frameworks, RM4Es and Spark computing

ML frameworks

RM4Es

The Spark computing framework

ML workflows and Spark pipelines

ML as a step-by-step workflow

ML workflow examples

Spark notebooks

Notebook approach for ML

Step 1: Getting the software ready

Step 2: Installing the Knitr package

Step 3: Creating a simple report

Spark notebooks

Summary

2. Data Preparation for Spark ML

Accessing and loading datasets

Accessing publicly available datasets

Loading datasets into Spark

Exploring and visualizing datasets

Data cleaning

Dealing with data incompleteness

Data cleaning in Spark

Data cleaning made easy

Identity matching

Identity issues

Identity matching on Spark

Entity resolution

Short string comparison

Long string comparison

Record deduplication

Identity matching made better

Crowdsourced deduplication

Configuring the crowd

Using the crowd

Dataset reorganizing

Dataset reorganizing tasks

Dataset reorganizing with Spark SQL

Dataset reorganizing with R on Spark

Dataset joining

Dataset joining and its tool – the Spark SQL

Dataset joining in Spark

Dataset joining with the R data table package

Feature extraction

Feature development challenges

Feature development with Spark MLlib

Feature development with R

Repeatability and automation

Dataset preprocessing workflows

Spark pipelines for dataset preprocessing

Dataset preprocessing automation

Summary

3. A Holistic View on Spark

Spark for a holistic view

The use case

Fast and easy computing

Methods for a holistic view

Regression modeling

The SEM approach

Decision trees

Feature preparation

PCA

Grouping by category to use subject knowledge

Feature selection

Model estimation

MLlib implementation

The R notebooks' implementation

Model evaluation

Quick evaluations

RMSE

ROC curves

Results explanation

Impact assessments

Deployment

Dashboard

Rules

Summary

4. Fraud Detection on Spark

Spark for fraud detection

The use case

Distributed computing

Methods for fraud detection

Random forest

Decision trees

Feature preparation

Feature extraction from LogFile

Data merging

Model estimation

MLlib implementation

R notebooks implementation

Model evaluation

A quick evaluation

Confusion matrix and false positive ratios

Results explanation

Big influencers and their impacts

Deploying fraud detection

Rules

Scoring

Summary

5. Risk Scoring on Spark

Spark for risk scoring

The use case

Apache Spark notebooks

Methods of risk scoring

Logistic regression

Preparing coding in R

Random forest and decision trees

Preparing coding

Data and feature preparation

OpenRefine

Model estimation

The DataScientistWorkbench for R notebooks

R notebooks implementation

Model evaluation

Confusion matrix

ROC

Kolmogorov-Smirnov

Results explanation

Big influencers and their impacts

Deployment

Scoring

Summary

6. Churn Prediction on Spark

Spark for churn prediction

The use case

Spark computing

Methods for churn prediction

Regression models

Decision trees and Random forest

Feature preparation

Feature extraction

Feature selection

Model estimation

Spark implementation with MLlib

Model evaluation

Results explanation

Calculating the impact of interventions

Deployment

Scoring

Intervention recommendations

Summary

7. Recommendations on Spark

Apache Spark for a recommendation engine

The use case

SPSS on Spark

Methods for recommendation

Collaborative filtering

Preparing coding

Data treatment with SPSS

Missing data nodes on SPSS modeler

Model estimation

SPSS on Spark – the SPSS Analytics server

Model evaluation

Recommendation deployment

Summary

8. Learning Analytics on Spark

Spark for attrition prediction

The use case

Spark computing

Methods of attrition prediction

Regression models

About regression

Preparing for coding

Decision trees

Preparing for coding

Feature preparation

Feature development

Feature selection

Principal components analysis

Subject knowledge aid

ML feature selection

Model estimation

Spark implementation with the Zeppelin notebook

Model evaluation

A quick evaluation

The confusion matrix and error ratios

Results explanation

Calculating the impact of interventions

Calculating the impact of main causes

Deployment

Rules

Scoring

Summary

9. City Analytics on Spark

Spark for service forecasting

The use case

Spark computing

Methods of service forecasting

Regression models

About regression

Preparing for coding

Time series modeling

About time series

Preparing for coding

Data and feature preparation

Data merging

Feature selection

Model estimation

Spark implementation with the Zeppelin notebook

Spark implementation with the R notebook

Model evaluation

RMSE calculation with MLlib

RMSE calculation with R

Explanations of the results

Biggest influencers

Visualizing trends

The rules of sending out alerts

Scores to rank city zones

Summary

10. Learning Telco Data on Spark

Spark for using Telco Data

The use case

Spark computing

Methods for learning from Telco Data

Descriptive statistics and visualization

Linear and logistic regression models

Decision tree and random forest

Data and feature development

Data reorganizing

Feature development and selection

Model estimation

SPSS on Spark – SPSS Analytics Server

Model evaluation

RMSE calculations with MLlib

RMSE calculations with R

Confusion matrix and error ratios with MLlib and R

Results explanation

Descriptive statistics and visualizations

Biggest influencers

Special insights

Visualizing trends

Model deployment

Rules to send out alerts

Scores subscribers for churn and for Call Center calls

Scores subscribers for purchase propensity

Summary

11. Modeling Open Data on Spark

Spark for learning from open data

The use case

Spark computing

Methods for scoring and ranking

Cluster analysis

Principal component analysis

Regression models

Score resembling

Data and feature preparation

Data cleaning

Data merging

Feature development

Feature selection

Model estimation

SPSS on Spark – SPSS Analytics Server

Model evaluation

RMSE calculations with MLlib

RMSE calculations with R

Results explanation

Comparing ranks

Biggest influencers

Deployment

Rules for sending out alerts

Scores for ranking school districts

Summary

Index

Apache Spark Machine Learning Blueprints

All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author nor Packt Publishing, and its dealers and distributors will be held liable for any damages caused or alleged to be caused directly or indirectly by this book.

Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

First published: May 2016

Production reference: 1250516

Published by Packt Publishing Ltd.

Livery Place

35 Livery Street

Birmingham B3 2PB, UK.

ISBN 978-1-78588-039-1

www.packtpub.com

Credits

Author

Alex Liu

Reviewer

Hao Ren

Commissioning Editor

Dipika Gaonkar

Acquisition Editor

Meeta Rajani

Content Development Editor

Anish Sukumaran

Technical Editors

Dhiraj Chandanshive

Siddhesh Patil

Copy Editor

Shruti Iyer

Project Coordinator

Izzat Contractor

Proofreader

Safis Editing

Indexer

Mariammal Chettiyar

Graphics

Disha Haria

Production Coordinator

Nilesh R. Mohite

Cover Work

Nilesh R. Mohite

About the Author

Alex Liu is an expert in research methods and data science. He is currently one of IBM's leading experts in big data analytics and also a lead data scientist, where he serves big corporations, develops big data analytics IPs, and speaks at industrial conferences such as STRATA, Insights, SMAC, and BigDataCamp. In the past, Alex served as chief or lead data scientist for a few companies, including Yapstone, RS, and TRG. Before this, he was a lead consultant and director at RMA, where he provided data analytics consultation and training to many well-known organizations, including the United Nations, Indymac, AOL, Ingram Micro, GEM, Farmers Insurance, Scripps Networks, Sears, and USAID. At the same time, Dr. Liu taught advanced research methods to PhD candidates at University of Southern California and University of California at Irvine. Before this, he worked as a managing director for CATE/GEC and as a research fellow for the Asia/Pacific Research Center at Stanford University. Alex has a Ph.D. in quantitative sociology and a master's degree of science in statistical computing from Stanford University.

I would like to thank IBM for providing a great open and innovative environment to learn and practice Big Data analytics. I would especially like to thank my managers, Kim Siegel and Kevin Zachary, for their support and encouragement, without which it would not have been possible to complete this book.

I also would like to thank my beautiful wife, Lauria, and two beautiful daughters, Kate and Khloe, for their patience and support, which enabled me to work effectively. Finally, I would like to thank the Packt staff, especially Anish Sukumaran and Meeta Rajani, for making the writing and editing process smooth and joyful.

About the Reviewer

Hao Ren is data engineer working in Paris for a classified advertising website named leboncoin (https://www.leboncoin.fr/), which is the fifth most visited site in France. Three years' work experience of functional programming in Scala, Machine Learning, and distributed systems defines his career. Hao's main speciality is based on machine learning with Apache Spark, such as building a crawler detection system, recommander system, and so on. He has also reviewed a more detailed and advanced book by Packt Publishing, Machine Learning with Spark, which is worth a read as well.

www.PacktPub.com

eBooks, discount offers, and more

Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.PacktPub.com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at <[email protected]> for more details.

At www.PacktPub.com, you can also read a collection of free technical articles, sign up for a range of free newsletters and receive exclusive discounts and offers on Packt books and eBooks.

https://www2.packtpub.com/books/subscription/packtlib

Do you need instant solutions to your IT questions? PacktLib is Packt's online digital book library. Here, you can search, access, and read Packt's entire library of books.

Why subscribe?

Fully searchable across every book published by PacktCopy and paste, print, and bookmark contentOn demand and accessible via a web browser

Preface

As data scientists and machine learning professionals, our jobs are to build models for detecting frauds, predicting customer churns, or turning data into insights in a broad sense; for this, we sometimes need to process huge amounts of data and handle complicated computations. Therefore, we are always excited to see new computing tools, such as Spark, and spend a lot of time learning about them. To learn about these new tools, a lot of learning materials are available, but they are from a more computing perspective, and often written by computer scientists.

We, the data scientists and machine learning professionals, as users of Spark, are more concerned about how the new systems can help us build models with more predictive accuracy and how these systems can make data processing and coding easy for us. This is the main reason why this book has been developed and why this book has been written by a data scientist.

At the same time, we, as data scientists and machine learning professionals, have already developed our frameworks and processes as well as used some good model building tools, such as R and SPSS. We understand that some of the new tools, such as MLlib of Spark, may replace certain old tools, but not all of them. Therefore, using Spark together with our existing tools is essential to us as users of Spark and becomes one of the main focuses for this book, which is also one of the critical elements, making this book different from other Spark books.

Overall, this is a Spark book written by a data scientist for data scientists and machine learning professionals to make machine learning easy for us with Spark.

What this book covers

Chapter 1, Spark for Machine Learning, introduces Apache Spark from a machine learning perspective. We will discuss Spark dataframes and R, Spark pipelines, RM4Es data science framework, as well as the Spark notebook and implementation models.

Chapter 2, Data Preparation for Spark ML, focuses on data preparation for machine learning on Apache Spark with tools such as Spark SQL. We will discuss data cleaning, identity matching, data merging, and feature development.

Chapter 3, A Holistic View on Spark, clearly explains the RM4E machine learning framework and processes with a real-life example and also demonstrates the benefits of obtaining holistic views for businesses easily with Spark.

Chapter 4, Fraud Detection on Spark, discusses how Spark makes machine learning for fraud detection easy and fast. At the same time, we will illustrate a step-by-step process of obtaining fraud insights from big data.

Chapter 5, Risk Scoring on Spark, reviews machine learning methods and processes for a risk scoring project and implements them using R notebooks on Apache Spark in a special DataScientistWorkbench environment. Our focus for this chapter is the notebook.

Chapter 6, Churn Prediction on Spark, further illustrates our special step-by-step machine learning process on Spark with a focus on using MLlib to develop customer churn predictions to improve customer retention.

Chapter 7, Recommendations on Spark, describes how to develop recommendations with big data on Spark by utilizing SPSS on the Spark system.

Chapter 8, Learning Analytics on Spark, extends our application to serve learning organizations like universities and training institutions, for which we will apply machine learning to improve learning analytics for a real case of predicting student attrition.

Chapter 9, City Analytics on Spark, helps the readers to gain a better understanding about how Apache Spark could be utilized not only for commercial use, but also for public use as to serve cities with a real use case of predicting service requests on Spark.

Chapter 10, Learning Telco Data on Spark, further extends what was studied in the previous chapters and allows readers to combine what was learned for a dynamic machine learning with a huge amount of Telco Data on Spark.

Chapter 11, Modeling Open Data on Spark, presents dynamic machine learning with open data on Spark from which users can take a data-driven approach and utilize all the technologies available for optimal results. This chapter is an extension of Chapter 9, City Analytics on Spark, and Chapter 10, Learning Telco Data on Spark, as well as a good review of all the previous chapters with a real-life project.

What you need for this book

Throughout this book, we assume that you have some basic experience of programming, either in Scala or Python; some basic experience with modeling tools, such as R or SPSS; and some basic knowledge of machine learning and data science.

Who this book is for

This book is written for analysts, data scientists, researchers, and machine learning professionals who need to process Big Data but who are not necessarily familiar with Spark.

Reader feedback

Feedback from our readers is always welcome. Let us know what you think about this book—what you liked or disliked. Reader feedback is important for us as it helps us develop titles that you will really get the most out of.

To send us general feedback, simply e-mail <[email protected]>, and mention the book's title in the subject of your message.

If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, see our author guide at www.packtpub.com/authors.

Customer support

Now that you are the proud owner of a Packt book, we have a number of things to help you to get the most from your purchase.

Downloading the color images of this book

We also provide you with a PDF file that has color images of the screenshots/diagrams used in this book. The color images will help you better understand the changes in the output. You can download this file from http://www.packtpub.com/sites/default/files/downloads/ApacheSparkMachineLearningBlueprints_ColorImages.pdf.

Errata

Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you find a mistake in one of our books—maybe a mistake in the text or the code—we would be grateful if you could report this to us. By doing so, you can save other readers from frustration and help us improve subsequent versions of this book. If you find any errata, please report them by visiting http://www.packtpub.com/submit-errata, selecting your book, clicking on the Errata Submission Form link, and entering the details of your errata. Once your errata are verified, your submission will be accepted and the errata will be uploaded to our website or added to any list of existing errata under the Errata section of that title.

To view the previously submitted errata, go to https://www.packtpub.com/books/content/support and enter the name of the book in the search field. The required information will appear under the Errata section.

Piracy

Piracy of copyrighted material on the Internet is an ongoing problem across all media. At Packt, we take the protection of our copyright and licenses very seriously. If you come across any illegal copies of our works in any form on the Internet, please provide us with the location address or website name immediately so that we can pursue a remedy.

Please contact us at <[email protected]> with a link to the suspected pirated material.

We appreciate your help in protecting our authors and our ability to bring you valuable content.

Questions

If you have a problem with any aspect of this book, you can contact us at <[email protected]>, and we will do our best to address the problem.

Chapter 1. Spark for Machine Learning

This chapter provides an introduction to Apache Spark from a Machine Learning (ML) and data analytics perspective, and also discusses machine learning in relation to Spark computing. Here, we first present an overview of Apache Spark, as well as Spark's advantages for data analytics, in comparison to MapReduce and other computing platforms. Then we discuss five main issues, as below:

Machine learning algorithms and librariesSpark RDD and dataframesMachine learning frameworksSpark pipelinesSpark notebooks

All of the above are the most important topics that any data scientist or machine learning professional is expected to master, in order to fully take advantage of Apache Spark computing. Specifically, this chapter will cover all of the following six topics.

Spark overview and Spark advantagesML algorithms and ML libraries for SparkSpark RDD and dataframesML Frameworks, RM4Es and Spark computingML workflows and Spark pipelinesSpark notebooks introduction

Spark overview and Spark advantages

In this section, we provide an overview of the Apache Spark computing platform and a discussion about some advantages of utilizing Apache Spark, in comparison to using other computing platforms like MapReduce. Then, we briefly discuss how Spark computing fits modern machine learning and big data analytics.

After this section, readers will form a basic understanding of Apache Spark as well as a good understanding of some important machine learning benefits from utilizing Apache Spark.

Spark overview

Apache Spark is a computing framework for the fast processing of big data. This framework contains a distributed computing engine and a specially designed programming model. Spark was started as a research project at the AMPLab of the University of California at Berkeley in 2009, and then in 2010 it became fully open sourced as it was donated to the Apache Software Foundation. Since then, Apache Spark has experienced exponential growth, and now Spark is the most active open source project in the big data field.

Spark's computing utilizes an in-memory distributed computational approach, which makes Spark computing among the fastest, especially for iterative computation. It can run up to 100 times faster than Hadoop MapReduce, according to many tests that have been performed.

Apache Spark has a unified platform, which consists of the Spark core engine and four libraries: Spark SQL, Spark Streaming, MLlib, and GraphX. All of these four libraries have Python, Java and Scala programming APIs.

Besides the above mentioned four built-in libraries, there are also tens of packages available for Apache Spark, provided by third parties, which can be used for handling data sources, machine learning, and other tasks.

Apache Spark has a 3 month circle for new releases, with Spark version 1.6.0 released on January 4 of 2016. Apache Spark release 1.3 had DataFrames API and ML Pipelines API included. Starting from Apache Spark release 1.4, the R interface (SparkR) is included as default.

Note

To download Apache Spark, readers should go to http://spark.apache.org/downloads.html.

To install Apache Spark and start running it, readers should consult its latest documentation at http://spark.apache.org/docs/latest/.

Spark advantages

Apache Spark has many advantages over MapReduce and other big data computing platforms. Among them, the distinguished two are that it is fast to run and fast to write.

Overall, Apache Spark has kept some of MapReduce's most important advantages like that of scalability and fault tolerance, but extended them greatly with new technologies.

In comparison to MapReduce, Apache Spark's engine is capable of executing a more general Directed Acyclic Graph (DAG) of operators. Therefore, when using Apache Spark to execute MapReduce-style graphs, users can achieve higher performance batch processing in Hadoop.

Apache Spark has in-memory processing capabilities, and uses a new data abstraction method, Resilient Distributed Dataset (RDD), which enables highly iterative computing and reactive applications. This also extended its fault tolerance capability.

At the same time, Apache Spark has made complex pipeline representation easy with only a few lines of code needed. It is best known for the ease with which it can be used to create algorithms that capture insight from complex and even messy data, and also enable users to apply that insight in-time to drive outcomes.

As summarized by the Apache Spark team, Spark enables:

Iterative algorithms in Machine LearningInteractive data mining and data processingHive-compatible data warehousing that can run 100x fasterStream processingSensor data processing

To a practical data scientist working with the above, Apache Spark easily demonstrates its advantages when it is adopted for:

Parallel computingInteractive analyticsComplex computation

Most users are satisfied with Apache Spark's advantages in speed and performance, but some also noted that Apache Spark is still in the process of maturing.

Note

http://www.svds.com/use-cases-for-apache-spark/ has some examples of materialized Spark benefits.

Spark computing for machine learning

With its innovations on RDD and in-memory processing, Apache Spark has truly made distributed computing easily accessible to data scientists and machine learning professionals. According to the Apache Spark team, Apache Spark runs on the Mesos cluster manager, letting it share resources with Hadoop and other applications. Therefore, Apache Spark can read from any Hadoop input source like HDFS.

For the above, the Apache Spark computing model is very suitable to distributed computing for machine learning. Especially for rapid interactive machine learning, parallel computing, and complicated modelling at scale, Apache Spark should definitely be utilized.

According to the Spark development team, Spark's philosophy is to make life easy and productive for data scientists and machine learning professionals. Due to this, Apache Spark has:

Well documented, expressive API'sPowerful domain specific librariesEasy integration with storage systemsCaching to avoid data movement

Per the introduction by Patrick Wendell, co-founder of Databricks, Spark is especially made for large scale data processing. Apache Spark supports agile data science to iterate rapidly, and Spark can be integrated with IBM and other solutions easily.

Machine learning algorithms

In this section, we review algorithms that are needed for machine learning, and introduce machine learning libraries including Spark's MLlib and IBM's SystemML, then we discuss their integration with Apache Spark.

After reading this section, readers will become familiar with various machine learning libraries including Spark's MLlib, and know how to make them ready for machine learning.

To complete a Machine Learning project, data scientists often employ some classification or regression algorithms to develop and evaluate predictive models, which are readily available in some Machine Learning tools like R or MatLab. To complete a machine learning project, besides data sets and computing platforms, these machine learning libraries, as collections of machine learning algorithms, are necessary.

For example, the strength and depth of the popular R mainly comes from the various algorithms that are readily provided for the use of Machine Learning professionals. The total number of R packages is over 1000. Data scientists do not need all of them, but do need some packages to:

Load data, with packages like RODBC or RMySQLManipulate data, with packages like stringr or lubridateVisualize data, with packages like ggplot2 or leafletModel data, with packages like Random Forest or survivalReport results, with packages like shiny or markdown

According to a recent ComputerWorld survey, the most downloaded R packages are:

PACKAGE

# of DOWNLOADS

Rcpp

162778

ggplot2

146008

plyr

123889

stringr

120387

colorspace

118798

digest

113899

reshape2

109869

RColorBrewer

100623

scales

92448

manipulate

88664

Note

For more info, please visit http://www.computerworld.com/article/2920117/business-intelligence/most-downloaded-r-packages-last-month.html

MLlib

MLlib is Apache Spark's machine learning library. It is scalable, and consists of many commonly-used machine learning algorithms. Built-in to MLlib are algorithms for:

Handling data types in forms of vectors and matricesComputing basic statistics like summary statistics and correlations, as well as producing simple random and stratified samples, and conducting simple hypothesis testingPerforming classification and regression modelingCollaborative filteringClusteringPerforming dimensionality reductionConducting feature extraction and transformationFrequent pattern miningDeveloping optimizationExporting PMML models

The Spark MLlib is still under active development, with new algorithms expected to be added for every new release.

In line with Apache Spark's computing philosophy, the MLlib is built for easy use and deployment, with high performance.

MLlib uses the linear algebra package Breeze, which depends on netlib-java, and jblas. The packages netlib-java and jblas also depend on native Fortran routines. Users need to install the gfortran runtime library if it is not already present on their nodes. MLlib will throw a linking error if it cannot detect these libraries automatically.

Note

For MLlib use cases and further details on how to use MLlib, please visit:

http://spark.apache.org/docs/latest/mllib-guide.html.

Other ML libraries

As discussed in previous part, MLlib has made available many frequently used algorithms like regression and classification. But these basics are not enough for complicated machine learning.

If we wait for the Apache Spark team to add all the needed ML algorithms it may take a long time. For this, the good news is that many third parties have contributed ML libraries to Apache Spark.

IBM has contributed its machine learning library, SystemML, to Apache Spark.

Besides what MLlib provides, SystemML offers a lot more additional ML algorithms like the ones on missing data imputation, SVM, GLM, ARIMA, and non-linear optimizers, and some graphical modelling and matrix factonization algorithms.

As developed by the IBM Almaden Research group, IBM's SystemML is an engine for distributed machine learning and it can scale to arbitrary large data sizes. It provides the following benefits:

Unifies the fractured machine learning environmentsGives the core Spark ecosystem a complete set of DMLAllows a data scientist to focus on the algorithm, not the implementationImproves time to value for data science teamsEstablishes a de facto standard for reusable machine learning routines

SystemML is modeled after R syntax and semantics, and provides the ability to author new algorithms via its own language.

Through a good integration with R by SparkR, Apache Spark users also have the potential to utilize thousands of R packages for machine learning algorithms, when needed. As will be discussed in later sections of this chapter, the SparkR notebook will make this operation very easy.