Apache Spark Machine Learning Blueprints - Alex Liu - E-Book

Apache Spark Machine Learning Blueprints E-Book

Alex Liu

0,0
34,79 €

-100%
Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.

Mehr erfahren.
Beschreibung

Develop a range of cutting-edge machine learning projects with Apache Spark using this actionable guide

About This Book

  • Customize Apache Spark and R to fit your analytical needs in customer research, fraud detection, risk analytics, and recommendation engine development
  • Develop a set of practical Machine Learning applications that can be implemented in real-life projects
  • A comprehensive, project-based guide to improve and refine your predictive models for practical implementation

Who This Book Is For

If you are a data scientist, a data analyst, or an R and SPSS user with a good understanding of machine learning concepts, algorithms, and techniques, then this is the book for you. Some basic understanding of Spark and its core elements and application is required.

What You Will Learn

  • Set up Apache Spark for machine learning and discover its impressive processing power
  • Combine Spark and R to unlock detailed business insights essential for decision making
  • Build machine learning systems with Spark that can detect fraud and analyze financial risks
  • Build predictive models focusing on customer scoring and service ranking
  • Build a recommendation systems using SPSS on Apache Spark
  • Tackle parallel computing and find out how it can support your machine learning projects
  • Turn open data and communication data into actionable insights by making use of various forms of machine learning

In Detail

There's a reason why Apache Spark has become one of the most popular tools in Machine Learning – its ability to handle huge datasets at an impressive speed means you can be much more responsive to the data at your disposal. This book shows you Spark at its very best, demonstrating how to connect it with R and unlock maximum value not only from the tool but also from your data.

Packed with a range of project "blueprints" that demonstrate some of the most interesting challenges that Spark can help you tackle, you'll find out how to use Spark notebooks and access, clean, and join different datasets before putting your knowledge into practice with some real-world projects, in which you will see how Spark Machine Learning can help you with everything from fraud detection to analyzing customer attrition. You'll also find out how to build a recommendation engine using Spark's parallel computing powers.

Style and approach

This book offers a step-by-step approach to setting up Apache Spark, and use other analytical tools with it to process Big Data and build machine learning projects.The initial chapters focus more on the theory aspect of machine learning with Spark, while each of the later chapters focuses on building standalone projects using Spark.

Sie lesen das E-Book in den Legimi-Apps auf:

Android
iOS
von Legimi
zertifizierten E-Readern

Seitenzahl: 303

Veröffentlichungsjahr: 2016

Bewertungen
0,0
0
0
0
0
0
Mehr Informationen
Mehr Informationen
Legimi prüft nicht, ob Rezensionen von Nutzern stammen, die den betreffenden Titel tatsächlich gekauft oder gelesen/gehört haben. Wir entfernen aber gefälschte Rezensionen.



Table of Contents

Apache Spark Machine Learning Blueprints
Credits
About the Author
About the Reviewer
www.PacktPub.com
eBooks, discount offers, and more
Why subscribe?
Preface
What this book covers
What you need for this book
Who this book is for
Conventions
Reader feedback
Customer support
Downloading the color images of this book
Errata
Piracy
Questions
1. Spark for Machine Learning
Spark overview and Spark advantages
Spark overview
Spark advantages
Spark computing for machine learning
Machine learning algorithms
MLlib
Other ML libraries
Spark RDD and dataframes
Spark RDD
Spark dataframes
Dataframes API for R
ML frameworks, RM4Es and Spark computing
ML frameworks
RM4Es
The Spark computing framework
ML workflows and Spark pipelines
ML as a step-by-step workflow
ML workflow examples
Spark notebooks
Notebook approach for ML
Step 1: Getting the software ready
Step 2: Installing the Knitr package
Step 3: Creating a simple report
Spark notebooks
Summary
2. Data Preparation for Spark ML
Accessing and loading datasets
Accessing publicly available datasets
Loading datasets into Spark
Exploring and visualizing datasets
Data cleaning
Dealing with data incompleteness
Data cleaning in Spark
Data cleaning made easy
Identity matching
Identity issues
Identity matching on Spark
Entity resolution
Short string comparison
Long string comparison
Record deduplication
Identity matching made better
Crowdsourced deduplication
Configuring the crowd
Using the crowd
Dataset reorganizing
Dataset reorganizing tasks
Dataset reorganizing with Spark SQL
Dataset reorganizing with R on Spark
Dataset joining
Dataset joining and its tool – the Spark SQL
Dataset joining in Spark
Dataset joining with the R data table package
Feature extraction
Feature development challenges
Feature development with Spark MLlib
Feature development with R
Repeatability and automation
Dataset preprocessing workflows
Spark pipelines for dataset preprocessing
Dataset preprocessing automation
Summary
3. A Holistic View on Spark
Spark for a holistic view
The use case
Fast and easy computing
Methods for a holistic view
Regression modeling
The SEM approach
Decision trees
Feature preparation
PCA
Grouping by category to use subject knowledge
Feature selection
Model estimation
MLlib implementation
The R notebooks' implementation
Model evaluation
Quick evaluations
RMSE
ROC curves
Results explanation
Impact assessments
Deployment
Dashboard
Rules
Summary
4. Fraud Detection on Spark
Spark for fraud detection
The use case
Distributed computing
Methods for fraud detection
Random forest
Decision trees
Feature preparation
Feature extraction from LogFile
Data merging
Model estimation
MLlib implementation
R notebooks implementation
Model evaluation
A quick evaluation
Confusion matrix and false positive ratios
Results explanation
Big influencers and their impacts
Deploying fraud detection
Rules
Scoring
Summary
5. Risk Scoring on Spark
Spark for risk scoring
The use case
Apache Spark notebooks
Methods of risk scoring
Logistic regression
Preparing coding in R
Random forest and decision trees
Preparing coding
Data and feature preparation
OpenRefine
Model estimation
The DataScientistWorkbench for R notebooks
R notebooks implementation
Model evaluation
Confusion matrix
ROC
Kolmogorov-Smirnov
Results explanation
Big influencers and their impacts
Deployment
Scoring
Summary
6. Churn Prediction on Spark
Spark for churn prediction
The use case
Spark computing
Methods for churn prediction
Regression models
Decision trees and Random forest
Feature preparation
Feature extraction
Feature selection
Model estimation
Spark implementation with MLlib
Model evaluation
Results explanation
Calculating the impact of interventions
Deployment
Scoring
Intervention recommendations
Summary
7. Recommendations on Spark
Apache Spark for a recommendation engine
The use case
SPSS on Spark
Methods for recommendation
Collaborative filtering
Preparing coding
Data treatment with SPSS
Missing data nodes on SPSS modeler
Model estimation
SPSS on Spark – the SPSS Analytics server
Model evaluation
Recommendation deployment
Summary
8. Learning Analytics on Spark
Spark for attrition prediction
The use case
Spark computing
Methods of attrition prediction
Regression models
About regression
Preparing for coding
Decision trees
Preparing for coding
Feature preparation
Feature development
Feature selection
Principal components analysis
Subject knowledge aid
ML feature selection
Model estimation
Spark implementation with the Zeppelin notebook
Model evaluation
A quick evaluation
The confusion matrix and error ratios
Results explanation
Calculating the impact of interventions
Calculating the impact of main causes
Deployment
Rules
Scoring
Summary
9. City Analytics on Spark
Spark for service forecasting
The use case
Spark computing
Methods of service forecasting
Regression models
About regression
Preparing for coding
Time series modeling
About time series
Preparing for coding
Data and feature preparation
Data merging
Feature selection
Model estimation
Spark implementation with the Zeppelin notebook
Spark implementation with the R notebook
Model evaluation
RMSE calculation with MLlib
RMSE calculation with R
Explanations of the results
Biggest influencers
Visualizing trends
The rules of sending out alerts
Scores to rank city zones
Summary
10. Learning Telco Data on Spark
Spark for using Telco Data
The use case
Spark computing
Methods for learning from Telco Data
Descriptive statistics and visualization
Linear and logistic regression models
Decision tree and random forest
Data and feature development
Data reorganizing
Feature development and selection
Model estimation
SPSS on Spark – SPSS Analytics Server
Model evaluation
RMSE calculations with MLlib
RMSE calculations with R
Confusion matrix and error ratios with MLlib and R
Results explanation
Descriptive statistics and visualizations
Biggest influencers
Special insights
Visualizing trends
Model deployment
Rules to send out alerts
Scores subscribers for churn and for Call Center calls
Scores subscribers for purchase propensity
Summary
11. Modeling Open Data on Spark
Spark for learning from open data
The use case
Spark computing
Methods for scoring and ranking
Cluster analysis
Principal component analysis
Regression models
Score resembling
Data and feature preparation
Data cleaning
Data merging
Feature development
Feature selection
Model estimation
SPSS on Spark – SPSS Analytics Server
Model evaluation
RMSE calculations with MLlib
RMSE calculations with R
Results explanation
Comparing ranks
Biggest influencers
Deployment
Rules for sending out alerts
Scores for ranking school districts
Summary
Index

Apache Spark Machine Learning Blueprints

Apache Spark Machine Learning Blueprints

Copyright © 2016 Packt Publishing

All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author nor Packt Publishing, and its dealers and distributors will be held liable for any damages caused or alleged to be caused directly or indirectly by this book.

Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

First published: May 2016

Production reference: 1250516

Published by Packt Publishing Ltd.

Livery Place

35 Livery Street

Birmingham B3 2PB, UK.

ISBN 978-1-78588-039-1

www.packtpub.com

Credits

Author

Alex Liu

Reviewer

Hao Ren

Commissioning Editor

Dipika Gaonkar

Acquisition Editor

Meeta Rajani

Content Development Editor

Anish Sukumaran

Technical Editors

Dhiraj Chandanshive

Siddhesh Patil

Copy Editor

Shruti Iyer

Project Coordinator

Izzat Contractor

Proofreader

Safis Editing

Indexer

Mariammal Chettiyar

Graphics

Disha Haria

Production Coordinator

Nilesh R. Mohite

Cover Work

Nilesh R. Mohite

About the Author

Alex Liu is an expert in research methods and data science. He is currently one of IBM's leading experts in big data analytics and also a lead data scientist, where he serves big corporations, develops big data analytics IPs, and speaks at industrial conferences such as STRATA, Insights, SMAC, and BigDataCamp. In the past, Alex served as chief or lead data scientist for a few companies, including Yapstone, RS, and TRG. Before this, he was a lead consultant and director at RMA, where he provided data analytics consultation and training to many well-known organizations, including the United Nations, Indymac, AOL, Ingram Micro, GEM, Farmers Insurance, Scripps Networks, Sears, and USAID. At the same time, Dr. Liu taught advanced research methods to PhD candidates at University of Southern California and University of California at Irvine. Before this, he worked as a managing director for CATE/GEC and as a research fellow for the Asia/Pacific Research Center at Stanford University. Alex has a Ph.D. in quantitative sociology and a master's degree of science in statistical computing from Stanford University.

I would like to thank IBM for providing a great open and innovative environment to learn and practice Big Data analytics. I would especially like to thank my managers, Kim Siegel and Kevin Zachary, for their support and encouragement, without which it would not have been possible to complete this book.

I also would like to thank my beautiful wife, Lauria, and two beautiful daughters, Kate and Khloe, for their patience and support, which enabled me to work effectively. Finally, I would like to thank the Packt staff, especially Anish Sukumaran and Meeta Rajani, for making the writing and editing process smooth and joyful.

About the Reviewer

Hao Ren is data engineer working in Paris for a classified advertising website named leboncoin (https://www.leboncoin.fr/), which is the fifth most visited site in France. Three years' work experience of functional programming in Scala, Machine Learning, and distributed systems defines his career. Hao's main speciality is based on machine learning with Apache Spark, such as building a crawler detection system, recommander system, and so on. He has also reviewed a more detailed and advanced book by Packt Publishing, Machine Learning with Spark, which is worth a read as well.

www.PacktPub.com

eBooks, discount offers, and more

Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.PacktPub.com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at <[email protected]> for more details.

At www.PacktPub.com, you can also read a collection of free technical articles, sign up for a range of free newsletters and receive exclusive discounts and offers on Packt books and eBooks.

https://www2.packtpub.com/books/subscription/packtlib

Do you need instant solutions to your IT questions? PacktLib is Packt's online digital book library. Here, you can search, access, and read Packt's entire library of books.

Why subscribe?

Fully searchable across every book published by PacktCopy and paste, print, and bookmark contentOn demand and accessible via a web browser

Preface

As data scientists and machine learning professionals, our jobs are to build models for detecting frauds, predicting customer churns, or turning data into insights in a broad sense; for this, we sometimes need to process huge amounts of data and handle complicated computations. Therefore, we are always excited to see new computing tools, such as Spark, and spend a lot of time learning about them. To learn about these new tools, a lot of learning materials are available, but they are from a more computing perspective, and often written by computer scientists.

We, the data scientists and machine learning professionals, as users of Spark, are more concerned about how the new systems can help us build models with more predictive accuracy and how these systems can make data processing and coding easy for us. This is the main reason why this book has been developed and why this book has been written by a data scientist.

At the same time, we, as data scientists and machine learning professionals, have already developed our frameworks and processes as well as used some good model building tools, such as R and SPSS. We understand that some of the new tools, such as MLlib of Spark, may replace certain old tools, but not all of them. Therefore, using Spark together with our existing tools is essential to us as users of Spark and becomes one of the main focuses for this book, which is also one of the critical elements, making this book different from other Spark books.

Overall, this is a Spark book written by a data scientist for data scientists and machine learning professionals to make machine learning easy for us with Spark.

What this book covers

Chapter 1, Spark for Machine Learning, introduces Apache Spark from a machine learning perspective. We will discuss Spark dataframes and R, Spark pipelines, RM4Es data science framework, as well as the Spark notebook and implementation models.

Chapter 2, Data Preparation for Spark ML, focuses on data preparation for machine learning on Apache Spark with tools such as Spark SQL. We will discuss data cleaning, identity matching, data merging, and feature development.

Chapter 3, A Holistic View on Spark, clearly explains the RM4E machine learning framework and processes with a real-life example and also demonstrates the benefits of obtaining holistic views for businesses easily with Spark.

Chapter 4, Fraud Detection on Spark, discusses how Spark makes machine learning for fraud detection easy and fast. At the same time, we will illustrate a step-by-step process of obtaining fraud insights from big data.

Chapter 5, Risk Scoring on Spark, reviews machine learning methods and processes for a risk scoring project and implements them using R notebooks on Apache Spark in a special DataScientistWorkbench environment. Our focus for this chapter is the notebook.

Chapter 6, Churn Prediction on Spark, further illustrates our special step-by-step machine learning process on Spark with a focus on using MLlib to develop customer churn predictions to improve customer retention.

Chapter 7, Recommendations on Spark, describes how to develop recommendations with big data on Spark by utilizing SPSS on the Spark system.

Chapter 8, Learning Analytics on Spark, extends our application to serve learning organizations like universities and training institutions, for which we will apply machine learning to improve learning analytics for a real case of predicting student attrition.

Chapter 9, City Analytics on Spark, helps the readers to gain a better understanding about how Apache Spark could be utilized not only for commercial use, but also for public use as to serve cities with a real use case of predicting service requests on Spark.

Chapter 10, Learning Telco Data on Spark, further extends what was studied in the previous chapters and allows readers to combine what was learned for a dynamic machine learning with a huge amount of Telco Data on Spark.

Chapter 11, Modeling Open Data on Spark, presents dynamic machine learning with open data on Spark from which users can take a data-driven approach and utilize all the technologies available for optimal results. This chapter is an extension of Chapter 9, City Analytics on Spark, and Chapter 10, Learning Telco Data on Spark, as well as a good review of all the previous chapters with a real-life project.

What you need for this book

Throughout this book, we assume that you have some basic experience of programming, either in Scala or Python; some basic experience with modeling tools, such as R or SPSS; and some basic knowledge of machine learning and data science.

Who this book is for

This book is written for analysts, data scientists, researchers, and machine learning professionals who need to process Big Data but who are not necessarily familiar with Spark.

Reader feedback

Feedback from our readers is always welcome. Let us know what you think about this book—what you liked or disliked. Reader feedback is important for us as it helps us develop titles that you will really get the most out of.

To send us general feedback, simply e-mail <[email protected]>, and mention the book's title in the subject of your message.

If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, see our author guide at www.packtpub.com/authors.

Customer support

Now that you are the proud owner of a Packt book, we have a number of things to help you to get the most from your purchase.

Downloading the color images of this book

We also provide you with a PDF file that has color images of the screenshots/diagrams used in this book. The color images will help you better understand the changes in the output. You can download this file from http://www.packtpub.com/sites/default/files/downloads/ApacheSparkMachineLearningBlueprints_ColorImages.pdf.

Errata

Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you find a mistake in one of our books—maybe a mistake in the text or the code—we would be grateful if you could report this to us. By doing so, you can save other readers from frustration and help us improve subsequent versions of this book. If you find any errata, please report them by visiting http://www.packtpub.com/submit-errata, selecting your book, clicking on the Errata Submission Form link, and entering the details of your errata. Once your errata are verified, your submission will be accepted and the errata will be uploaded to our website or added to any list of existing errata under the Errata section of that title.

To view the previously submitted errata, go to https://www.packtpub.com/books/content/support and enter the name of the book in the search field. The required information will appear under the Errata section.

Piracy

Piracy of copyrighted material on the Internet is an ongoing problem across all media. At Packt, we take the protection of our copyright and licenses very seriously. If you come across any illegal copies of our works in any form on the Internet, please provide us with the location address or website name immediately so that we can pursue a remedy.

Please contact us at <[email protected]> with a link to the suspected pirated material.

We appreciate your help in protecting our authors and our ability to bring you valuable content.

Questions

If you have a problem with any aspect of this book, you can contact us at <[email protected]>, and we will do our best to address the problem.

Chapter 1. Spark for Machine Learning

This chapter provides an introduction to Apache Spark from a Machine Learning (ML) and data analytics perspective, and also discusses machine learning in relation to Spark computing. Here, we first present an overview of Apache Spark, as well as Spark's advantages for data analytics, in comparison to MapReduce and other computing platforms. Then we discuss five main issues, as below:

Machine learning algorithms and librariesSpark RDD and dataframesMachine learning frameworksSpark pipelinesSpark notebooks

All of the above are the most important topics that any data scientist or machine learning professional is expected to master, in order to fully take advantage of Apache Spark computing. Specifically, this chapter will cover all of the following six topics.

Spark overview and Spark advantagesML algorithms and ML libraries for SparkSpark RDD and dataframesML Frameworks, RM4Es and Spark computingML workflows and Spark pipelinesSpark notebooks introduction

Spark overview and Spark advantages

In this section, we provide an overview of the Apache Spark computing platform and a discussion about some advantages of utilizing Apache Spark, in comparison to using other computing platforms like MapReduce. Then, we briefly discuss how Spark computing fits modern machine learning and big data analytics.

After this section, readers will form a basic understanding of Apache Spark as well as a good understanding of some important machine learning benefits from utilizing Apache Spark.

Spark overview

Apache Spark is a computing framework for the fast processing of big data. This framework contains a distributed computing engine and a specially designed programming model. Spark was started as a research project at the AMPLab of the University of California at Berkeley in 2009, and then in 2010 it became fully open sourced as it was donated to the Apache Software Foundation. Since then, Apache Spark has experienced exponential growth, and now Spark is the most active open source project in the big data field.

Spark's computing utilizes an in-memory distributed computational approach, which makes Spark computing among the fastest, especially for iterative computation. It can run up to 100 times faster than Hadoop MapReduce, according to many tests that have been performed.

Apache Spark has a unified platform, which consists of the Spark core engine and four libraries: Spark SQL, Spark Streaming, MLlib, and GraphX. All of these four libraries have Python, Java and Scala programming APIs.

Besides the above mentioned four built-in libraries, there are also tens of packages available for Apache Spark, provided by third parties, which can be used for handling data sources, machine learning, and other tasks.

Apache Spark has a 3 month circle for new releases, with Spark version 1.6.0 released on January 4 of 2016. Apache Spark release 1.3 had DataFrames API and ML Pipelines API included. Starting from Apache Spark release 1.4, the R interface (SparkR) is included as default.

Note

To download Apache Spark, readers should go to http://spark.apache.org/downloads.html.

To install Apache Spark and start running it, readers should consult its latest documentation at http://spark.apache.org/docs/latest/.

Spark advantages

Apache Spark has many advantages over MapReduce and other big data computing platforms. Among them, the distinguished two are that it is fast to run and fast to write.

Overall, Apache Spark has kept some of MapReduce's most important advantages like that of scalability and fault tolerance, but extended them greatly with new technologies.

In comparison to MapReduce, Apache Spark's engine is capable of executing a more general Directed Acyclic Graph (DAG) of operators. Therefore, when using Apache Spark to execute MapReduce-style graphs, users can achieve higher performance batch processing in Hadoop.

Apache Spark has in-memory processing capabilities, and uses a new data abstraction method, Resilient Distributed Dataset (RDD), which enables highly iterative computing and reactive applications. This also extended its fault tolerance capability.

At the same time, Apache Spark has made complex pipeline representation easy with only a few lines of code needed. It is best known for the ease with which it can be used to create algorithms that capture insight from complex and even messy data, and also enable users to apply that insight in-time to drive outcomes.

As summarized by the Apache Spark team, Spark enables:

Iterative algorithms in Machine LearningInteractive data mining and data processingHive-compatible data warehousing that can run 100x fasterStream processingSensor data processing

To a practical data scientist working with the above, Apache Spark easily demonstrates its advantages when it is adopted for:

Parallel computingInteractive analyticsComplex computation

Most users are satisfied with Apache Spark's advantages in speed and performance, but some also noted that Apache Spark is still in the process of maturing.

Note

http://www.svds.com/use-cases-for-apache-spark/ has some examples of materialized Spark benefits.

Spark computing for machine learning

With its innovations on RDD and in-memory processing, Apache Spark has truly made distributed computing easily accessible to data scientists and machine learning professionals. According to the Apache Spark team, Apache Spark runs on the Mesos cluster manager, letting it share resources with Hadoop and other applications. Therefore, Apache Spark can read from any Hadoop input source like HDFS.

For the above, the Apache Spark computing model is very suitable to distributed computing for machine learning. Especially for rapid interactive machine learning, parallel computing, and complicated modelling at scale, Apache Spark should definitely be utilized.

According to the Spark development team, Spark's philosophy is to make life easy and productive for data scientists and machine learning professionals. Due to this, Apache Spark has:

Well documented, expressive API'sPowerful domain specific librariesEasy integration with storage systemsCaching to avoid data movement

Per the introduction by Patrick Wendell, co-founder of Databricks, Spark is especially made for large scale data processing. Apache Spark supports agile data science to iterate rapidly, and Spark can be integrated with IBM and other solutions easily.

Machine learning algorithms

In this section, we review algorithms that are needed for machine learning, and introduce machine learning libraries including Spark's MLlib and IBM's SystemML, then we discuss their integration with Apache Spark.

After reading this section, readers will become familiar with various machine learning libraries including Spark's MLlib, and know how to make them ready for machine learning.

To complete a Machine Learning project, data scientists often employ some classification or regression algorithms to develop and evaluate predictive models, which are readily available in some Machine Learning tools like R or MatLab. To complete a machine learning project, besides data sets and computing platforms, these machine learning libraries, as collections of machine learning algorithms, are necessary.

For example, the strength and depth of the popular R mainly comes from the various algorithms that are readily provided for the use of Machine Learning professionals. The total number of R packages is over 1000. Data scientists do not need all of them, but do need some packages to:

Load data, with packages like RODBC or RMySQLManipulate data, with packages like stringr or lubridateVisualize data, with packages like ggplot2 or leafletModel data, with packages like Random Forest or survivalReport results, with packages like shiny or markdown

According to a recent ComputerWorld survey, the most downloaded R packages are:

PACKAGE

# of DOWNLOADS

Rcpp

162778

ggplot2

146008

plyr

123889

stringr

120387

colorspace

118798

digest

113899

reshape2

109869

RColorBrewer

100623

scales

92448

manipulate

88664

Note

For more info, please visit http://www.computerworld.com/article/2920117/business-intelligence/most-downloaded-r-packages-last-month.html

MLlib

MLlib is Apache Spark's machine learning library. It is scalable, and consists of many commonly-used machine learning algorithms. Built-in to MLlib are algorithms for:

Handling data types in forms of vectors and matricesComputing basic statistics like summary statistics and correlations, as well as producing simple random and stratified samples, and conducting simple hypothesis testingPerforming classification and regression modelingCollaborative filteringClusteringPerforming dimensionality reductionConducting feature extraction and transformationFrequent pattern miningDeveloping optimizationExporting PMML models

The Spark MLlib is still under active development, with new algorithms expected to be added for every new release.

In line with Apache Spark's computing philosophy, the MLlib is built for easy use and deployment, with high performance.

MLlib uses the linear algebra package Breeze, which depends on netlib-java, and jblas. The packages netlib-java and jblas also depend on native Fortran routines. Users need to install the gfortran runtime library if it is not already present on their nodes. MLlib will throw a linking error if it cannot detect these libraries automatically.

Note

For MLlib use cases and further details on how to use MLlib, please visit:

http://spark.apache.org/docs/latest/mllib-guide.html.

Other ML libraries

As discussed in previous part, MLlib has made available many frequently used algorithms like regression and classification. But these basics are not enough for complicated machine learning.

If we wait for the Apache Spark team to add all the needed ML algorithms it may take a long time. For this, the good news is that many third parties have contributed ML libraries to Apache Spark.

IBM has contributed its machine learning library, SystemML, to Apache Spark.

Besides what MLlib provides, SystemML offers a lot more additional ML algorithms like the ones on missing data imputation, SVM, GLM, ARIMA, and non-linear optimizers, and some graphical modelling and matrix factonization algorithms.

As developed by the IBM Almaden Research group, IBM's SystemML is an engine for distributed machine learning and it can scale to arbitrary large data sizes. It provides the following benefits:

Unifies the fractured machine learning environmentsGives the core Spark ecosystem a complete set of DMLAllows a data scientist to focus on the algorithm, not the implementationImproves time to value for data science teamsEstablishes a de facto standard for reusable machine learning routines

SystemML is modeled after R syntax and semantics, and provides the ability to author new algorithms via its own language.

Through a good integration with R by SparkR, Apache Spark users also have the potential to utilize thousands of R packages for machine learning algorithms, when needed. As will be discussed in later sections of this chapter, the SparkR notebook will make this operation very easy.

Note

For more about IBM SystemML, please visit http://researcher.watson.ibm.com/researcher/files/us-ytian/systemML.pdf