Large Scale Machine Learning with Spark - Md. Rezaul Karim - E-Book

Large Scale Machine Learning with Spark E-Book

Md. Rezaul Karim

0,0
41,99 €

-100%
Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.
Mehr erfahren.
Beschreibung

Discover everything you need to build robust machine learning applications with Spark 2.0

About This Book

  • Get the most up-to-date book on the market that focuses on design, engineering, and scalable solutions in machine learning with Spark 2.0.0
  • Use Spark's machine learning library in a big data environment
  • You will learn how to develop high-value applications at scale with ease and a develop a personalized design

Who This Book Is For

This book is for data science engineers and scientists who work with large and complex data sets. You should be familiar with the basics of machine learning concepts, statistics, and computational mathematics. Knowledge of Scala and Java is advisable.

What You Will Learn

  • Get solid theoretical understandings of ML algorithms
  • Configure Spark on cluster and cloud infrastructure to develop applications using Scala, Java, Python, and R
  • Scale up ML applications on large cluster or cloud infrastructures
  • Use Spark ML and MLlib to develop ML pipelines with recommendation system, classification, regression, clustering, sentiment analysis, and dimensionality reduction
  • Handle large texts for developing ML applications with strong focus on feature engineering
  • Use Spark Streaming to develop ML applications for real-time streaming
  • Tune ML models with cross-validation, hyperparameters tuning and train split
  • Enhance ML models to make them adaptable for new data in dynamic and incremental environments

In Detail

Data processing, implementing related algorithms, tuning, scaling up and finally deploying are some crucial steps in the process of optimising any application.

Spark is capable of handling large-scale batch and streaming data to figure out when to cache data in memory and processing them up to 100 times faster than Hadoop-based MapReduce. This means predictive analytics can be applied to streaming and batch to develop complete machine learning (ML) applications a lot quicker, making Spark an ideal candidate for large data-intensive applications.

This book focuses on design engineering and scalable solutions using ML with Spark. First, you will learn how to install Spark with all new features from the latest Spark 2.0 release. Moving on, you'll explore important concepts such as advanced feature engineering with RDD and Datasets. After studying developing and deploying applications, you will see how to use external libraries with Spark.

In summary, you will be able to develop complete and personalised ML applications from data collections,model building, tuning, and scaling up to deploying on a cluster or the cloud.

Style and approach

This book takes a practical approach where all the topics explained are demonstrated with the help of real-world use cases.

Sie lesen das E-Book in den Legimi-Apps auf:

Android
iOS
von Legimi
zertifizierten E-Readern

Seitenzahl: 526

Bewertungen
0,0
0
0
0
0
0
Mehr Informationen
Mehr Informationen
Legimi prüft nicht, ob Rezensionen von Nutzern stammen, die den betreffenden Titel tatsächlich gekauft oder gelesen/gehört haben. Wir entfernen aber gefälschte Rezensionen.



Table of Contents

Large Scale Machine Learning with Spark
Credits
About the Authors
About the Reviewer
www.Packtpub.com
Why subscribe?
Preface
What this book covers
What you need for this book 
Who this book is for 
Conventions
Reader feedback
Customer support
Downloading the example code
Errata
Piracy
Questions
1. Introduction to Data Analytics with Spark
Spark overview
Spark basics
Beauties of Spark
New computing paradigm with Spark
Traditional distributed computing
Moving code to the data
RDD – a new computing paradigm
Spark ecosystem
Spark core engine
Spark SQL
DataFrames and datasets unification
Spark streaming
Graph computation – GraphX
Machine learning and Spark ML pipelines
Statistical computation – SparkR
Spark machine learning libraries
Machine learning with Spark
Spark MLlib
Data types
Basic statistics
Classification and regression
Recommender system development
Clustering
Dimensionality reduction
Feature extraction and transformation
Frequent pattern mining
Spark ML
Installing and getting started with Spark
Packaging your application with dependencies
Running a sample machine learning application
Running a Spark application from the Spark shell
Running a Spark application on the local cluster
Running a Spark application on the EC2 cluster
References
Summary
2. Machine Learning Best Practices
What is machine learning?
Machine learning in modern literature
Machine learning and computer science
Machine learning in statistics and data analytics
Typical machine learning workflow
Machine learning tasks
Supervised learning
Unsupervised learning
Reinforcement learning
Recommender system
Semi-supervised learning
Practical machine learning problems
Machine learning classes
Classification and clustering
Rule extraction and regression
Most widely used machine learning problems
Large scale machine learning APIs in Spark
Spark machine learning libraries
Spark MLlib
Spark ML
Important notes for practitioners
Practical machine learning best practices
Best practice before developing an ML application
Good machine learning and data science worth huge
Best practice – feature engineering and algorithmic performance
Beware of overfitting and underfitting
Stay tuned and combining Spark MLlib with Spark ML
Making ML applications modular and simplifying pipeline synthesis
Thinking of an innovative ML system
Thinking and becoming smarter about Big Data complexities
Applying machine learning to dynamic data
Best practice after developing an ML application
How to enable real-time ML visualization
Do some error analysis
Keeping your ML application tuned
Keeping your ML application adaptive and scale-up
Choosing the right algorithm for your application
Considerations when choosing an algorithm
Accuracy
Training time
Linearity
Talking to your data when choosing an algorithm
Number of parameters
How large is your training set?
Number of features
Special notes on widely used ML algorithms
Logistic regression and linear regression
Recommendation systems
Decision trees
Random forests
Decision forests, decision jungles, and variants
Bayesian methods
Summary
3. Understanding the Problem by Understanding the Data
Analyzing and preparing your data
Data preparation process
Data selection
Data pre–processing
Data transformation
Resilient Distributed Dataset basics
Reading the Datasets
Reading from files
Reading from a text file
Reading multiple text files from a directory
Reading from existing collections
Pre–processing with RDD
Getting insight from the SMSSpamCollection dataset
Working with the key/value pair
mapToPair()
More about transformation
map and flatMap
groupByKey, reduceByKey, and aggregateByKey
sortByKey and sortBy
Dataset basics
Reading datasets to create the Dataset
Reading from the files
Reading from the Hive
Pre-processing with Dataset
More about Dataset manipulation
Running SQL queries on Dataset
Creating Dataset from the Java Bean
Dataset from string and typed class
Comparison between RDD, DataFrame and Dataset
Spark and data scientists workflow
Deeper into Spark
Shared variables
Broadcast variables
Accumulators
Summary
4. Extracting Knowledge through Feature Engineering
The state of the art of feature engineering
Feature extraction versus feature selection
Importance of feature engineering
Feature engineering and data exploration
Feature extraction – creating features out of data
Feature selection – filtering features from data
Importance of feature selection
Feature selection versus dimensionality reduction
Best practices in feature engineering
Understanding the data
Innovative way of feature extraction
Feature engineering with Spark
Machine learning pipeline – an overview
Pipeline – an example with Spark ML
Feature transformation, extraction, and selection
Transformation – RegexTokenizer
Transformation – StringIndexer
Transformation – StopWordsRemover
Extraction – TF
Extraction – IDF
Selection – ChiSqSelector
Advanced feature engineering
Feature construction
Feature learning
Iterative process of feature engineering
Deep learning
Summary
5. Supervised and Unsupervised Learning by Examples
Machine learning classes
Supervised learning
Supervised learning example
Supervised learning with Spark - an example
Air-flight delay analysis using Spark
Loading and parsing the Dataset
Feature extraction
Preparing the training and testing set
Training the model
Testing the model
Unsupervised learning
Unsupervised learning example
Unsupervised learning with Spark - an example
K-means clustering of the neighborhood
Recommender system
Collaborative filtering in Spark
Advanced learning and generalizations
Generalizations of supervised learning
Summary
6. Building Scalable Machine Learning Pipelines
Spark machine learning pipeline APIs
Dataset abstraction
Pipeline
Cancer-diagnosis pipeline with Spark
Breast-cancer-diagnosis pipeline with Spark
Background study
Dataset collection
Dataset description and preparation
Problem formalization
Developing a cancer-diagnosis pipeline with Spark ML
Cancer-prognosis pipeline with Spark
Dataset exploration
Breast-cancer-prognosis pipeline with Spark ML/MLlib
Market basket analysis with Spark Core
Background
Motivations
Exploring the dataset
Problem statements
Large-scale market basket analysis using Spark
The algorithm solution using Spark Core
Tuning and setting the correct parameters in SAMBA
OCR pipeline with Spark
Exploring and preparing the data
OCR pipeline with Spark ML and Spark MLlib
Topic modeling using Spark MLlib and ML
Topic modeling with Spark MLlib
Scalability
Credit risk analysis pipeline with Spark
What is credit risk analysis? Why is it important?
Developing a credit risk analysis pipeline with Spark ML
The dataset exploration
Credit risk pipeline with Spark ML
Performance tuning and suggestions
Scaling the ML pipelines
Size matters
Size versus skewness considerations
Cost and infrastructure
Tips and performance considerations
Summary
7. Tuning Machine Learning Models
Details about machine learning model tuning
Typical challenges in model tuning
Evaluating machine learning models
Evaluating a regression model
Evaluating a binary classification model
Evaluating a multiclass classification model
Evaluating a clustering model
Validation and evaluation techniques
Parameter tuning for machine learning models
Hyperparameter tuning
Grid search parameter tuning
Random search parameter tuning
Cross-validation
Hypothesis testing
Hypothesis testing using ChiSqTestResult of Spark MLlib
Hypothesis testing using the Kolmogorov–Smirnov test from Spark MLlib
Streaming significance testing of Spark MLlib
Machine learning model selection
Model selection via the cross-validation technique
Cross-validation and Spark
Cross-validation using Spark ML for SPAM filtering a dataset
Model selection via training validation split
Linear regression–based model selection for an OCR dataset
Logistic regression-based model selection for the cancer dataset
Summary
8. Adapting Your Machine Learning Models
Adapting machine learning models
Technical overview
The generalization of ML models
Generalized linear regression
Generalized linear regression with Spark
Adapting through incremental algorithms
Incremental support vector machine
Adapting SVMs for new data with Spark
Incremental neural networks
Multilayer perceptron classification with Spark
Incremental Bayesian networks
Classification using Naive Bayes with Spark
Adapting through reusing ML models
Problem statements and objectives
Data exploration
Developing a heart diseases predictive model
Machine learning in dynamic environments
Online learning
Statistical learning model
Adversarial model
Summary
9. Advanced Machine Learning with Streaming and Graph Data
Developing real-time ML pipelines
Streaming data collection as unstructured text data
Labeling the data towards making the supervised machine learning
Creating and building the model
Real-time predictive analytics
Tuning the ML model for improvement and model evaluation
Model adaptability and deployment
Time series and social network analysis
Time series analysis
Social network analysis
Movie recommendation using Spark
Model-based movie recommendation using Spark MLlib
Data exploration
Movie recommendation using Spark MLlib
Developing a real-time ML pipeline from streaming
Real-time tweet data collection from Twitter
Tweet collection using TwitterUtils API of Spark
Topic modeling using Spark
ML pipeline on graph data and semi-supervised graph-based learning
Introduction to GraphX
Getting and parsing graph data using the GraphX API
Finding the connected components
Summary
10. Configuring and Working with External Libraries
Third-party ML libraries with Spark
Using external libraries with Spark Core
Time series analysis using the Cloudera Spark-TS package
Time series data
Configuring Spark-TS
TimeSeriesRDD
Configuring SparkR with RStudio
Configuring Hadoop run-time on Windows
Summary

Large Scale Machine Learning with Spark

Large Scale Machine Learning with Spark

Copyright © 2016 Packt Publishing

All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the authors, nor Packt Publishing, and its dealers and distributors will be held liable for any damages caused or alleged to be caused directly or indirectly by this book.

Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

First published: October 2016

Production reference: 1201016

Published by Packt Publishing Ltd.

Livery Place

35 Livery Street

Birmingham 

B3 2PB, UK.

ISBN 978-1-78588-874-8

www.packtpub.com

Credits

Authors

Md. Rezaul Karim

Md. Mahedi Kaysar 

Copy Editor

Safis Editing

Reviewer

Muthukumar Subramanian

Project Coordinator

Shweta H Birwatkar 

Commissioning Editor

Akram Hussain

Proofreader

Safis Editing

Acquisition Editor

Lester Frias

Indexer

Aishwarya Gangawane

Content Development Editor

Amrita Noronha

Graphics

Disha Haria

Technical Editor

Akash Patel

Production Coordinator

Arvindkumar Gupta

About the Authors

Md. Rezaul Karim has more than 8 years of experience in the area of research and development with a solid knowledge of algorithms and data structures, focusing C, C++, Java, R, and Python and big data technologies such as Spark, Kafka, DC/OS, Docker, Mesos, Hadoop, and MapReduce.

He was first enchanted by machine learning while studying an Advanced Artificial Intelligence post-graduate course by applying the combined technique of Hadoop-based MapReduce and machine learning together for market basket analysis on large-scale business-oriented transactional databases in back 2010. Consequently, his research interests include machine learning, data mining, Semantic Web, big data, and bioinformatics. He has published more than 30 research papers in renowned peer-reviewed international journals and conferences focusing on the areas of data mining, machine learning, and bioinformatics, with good citations.

He is a Software Engineer and Researcher currently working at the Insight Centre for Data Analytics, Ireland (the largest data analytics center in Ireland and the largest Semantic Web research institute in the world) as a PhD Researcher. He is also a PhD candidate at the National University of Ireland, Galway. He also holds an ME (Master of Engineering) degree in Computer Engineering from the Kyung Hee University, Korea, majoring in data mining and knowledge discovery. And he has a BS (Bachelor of Science) degree in Computer Science from the University of Dhaka, Bangladesh.

Before joining the Insight Center for Data Analytics, he had been working as a Lead Software Engineer with Samsung Electronics, where he worked with the distributed Samsung R&D centers across the world, including Korea, India, Vietnam, Turkey, UAE, Brazil, and Bangladesh. Before that, he worked as a Graduate Research Assistant in the Database Lab at Kyung Hee University, Korea, while working towards his Master's degree. He also worked as an R&D Engineer with BMTech21 Worldwide, Korea. Even before that, he worked as a Software Engineer with i2SoftTechnology, Dhaka, Bangladesh.

This book could not have been written without the support of my family and friends. In particular, my wife Saroar and my son Shadman deserve many thanks for their patience and encouragement throughout the past year.

I would like to give special thanks to Md. Mahedi Kaysar for co-authoring this book, and without his contributions, the writing would have been impossible. I would also like to thank Jaynal Abedin for his valuable suggestions towards different statistical and machine algorithms. Overall, I would like to dedicate this book to my respected teacher and research Guru Prof. Dr. Chowdhury Farhan Ahmed (Dept. of Computer Science & Engineering, University of Dhaka, Bangladesh) for his endless contributions to my life.

Further, I would like to thank the acquisition, content development and technical editors of Packt Publishing (and others who were involved to this book title) for their sincere cooperation and coordination. Additionally, without the work of numerous researchers who shared their expertise in publications, lectures, and source code, this book might not exist at all! Finally, I appreciate the efforts of the Apache Spark community and all those who have contributed to Spark APIs, whose work ultimately brought the machine learning to the masses.

Md. Mahedi Kaysar is a Software Engineer and Researcher at the Insight Center for Data Analytics (the largest data analytics center across the Ireland and the largest semantic web research institute in the world), Dublin City University (DCU), Ireland. Before joining the Insight Center at DCU, he worked as a Software Engineer at the Insight Center for Data Analytics, National University of Ireland, Galway and Samsung Electronics, Bangladesh.

He has more than 5 years of experience in research and development with a strong background in algorithms and data structures concentrating on C, Java, Scala, and Python. He has lots of experience in enterprise application development and big data analytics.

He obtained a BSc in Computer Science and Engineering from the Chittagong University of Engineering and Technology, Bangladesh. Now, he has started his postgraduate research in Distributed and Parallel Computing at the Dublin City University, Ireland.

His research interests include Distributed Computing, Semantic Web, Linked Data, big data, Internet of Everything, and machine learning. Moreover, he was involved in a research project in collaboration with CISCO Systems Inc. in the area of Internet of Everything and Semantic Web Technologies. His duties were to develop an IoT-enabled meeting management system, a scalable system for stream processing, designing, and showcasing the use cases of a project.

This book could not have been written without the support of my family and friends. In particular, my beloved wife IrinAkhtar deserves many thanks for her patience and encouragement throughout the past year.

I would like to give special thanks to Md. Rezaul Karim for co-authoring this book and without his contributions, the writing would have been impossible. I would also like thank Packt Publishing and the group members who provides us a lot of support with the writing of this book, which helped to complete the book in time. 

About the Reviewer

Muthukumar Subramanian (https://www.linkedin.com/in/muthu4all) is an iSMAC (IoT, Social, Mobility, Analytics, Cloud) technologist, passionate about creating solutions that will yield better results. He is a new technology advisor and a startup incubator implementing innovative approaches, products, and solutions by embracing the latest technologies such as Spark Streaming, Spark Graph Processing, Lambda Architecture in big data Ecosystem, NoSQL, Apache Flink, and Amazon Web Services. He has vast training and consulting experience with Apache Spark and MLlib for various project implementation for many professionals and organisations from various domains and verticals.

www.Packtpub.com

For support files and downloads related to your book, please visit www.PacktPub.com.

Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.PacktPub.com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at [email protected] for more details.

At www.PacktPub.com, you can also read a collection of free technical articles, sign up for a range of free newsletters and receive exclusive discounts and offers on Packt books and eBooks.

https://www.packtpub.com/mapt

Get the most in-demand software skills with Mapt. Mapt gives you full access to all Packt books and video courses, as well as industry-leading tools to help you plan your personal development and advance your career.

Why subscribe?

Fully searchable across every book published by PacktCopy and paste, print, and bookmark contentOn demand and accessible via a web browser

Preface

Machine learning, at its core, is concerned with algorithms that transform raw data into information into actionable intelligence. This fact makes machine learning well suited to the predictive analytics of Big Data. Without machine learning, therefore, it would be nearly impossible to keep up with these massive streams of information altogether. Spark, which is relatively a new and emerging technology, provides big data engineers and data scientists a powerful response and a unified engine that is both faster and easy to use.

This allows learners from numerous areas to solve their machine learning problems interactively and at much greater scale. The book is designed to enable data scientists, engineers, and researchers to develop and deploy their machine learning applications at scale so that they can learn how to handle large data clusters in data intensive environments to build powerful machine learning models.

The contents of the books have been written in a bottom-up approach from Spark and ML basics, exploring data with feature engineering, building scalable ML pipelines, tuning and adapting them through for the new data and problem types, and finally, model building to deployment. To clarify more, we have provided the chapters outline in such a way that a new reader with a minimum of knowledge of machine learning and programming with Spark will be able to follow the examples and move towards some real-life machine learning problems and their solutions. 

What this book covers

Chapter 1, Introduction to Data Analytics with Spark, this chapter covers Spark's overview, its computing paradigm, installation, and help us get started with Spark. It will briefly describe the main components of Spark and focus on its new computing advancements with the Resilient Distributed Datasets (RDD) and Dataset. It will then focus on the Spark’s ecosystem of machine learning libraries. Installing, configuring, and packaging a simple machine learning application with Spark and Maven will be demonstrated before scaling up on Amazon EC2.

Chapter 2, Machine Learning Best Practices, provides a conceptual introduction to statistical machine learning (ML) techniques aiming to take a newcomer from a minimal knowledge of machine learning all the way to being a knowledgeable practitioner in a few steps. The second part of the chapter is focused on providing some recommendations for choosing the right machine learning algorithms depending on its application types and requirements. It will then go through some best practices when applying large-scale machine learning pipelines.

Chapter 3, Understanding the Problem by Understanding the Data, covers in detail the Dataset and Resilient Distributed Dataset (RDD) APIs for working with structured data, aiming to provide a basic understanding of machine learning problems with the available data. By the end, you will be able to deal with basic and complex data manipulation with ease. Some comparisons will be made available with basic abstractions in Spark using RDD and Dataset-based data manipulation to show gains both in terms of programming and performance. In addition, we will guide you on the right track so that you will be able to use Spark to persist an RDD or data object in memory, allowing it to be reused efficiently across the parallel operations in the later stage.

Chapter 4, Extracting Knowledge through Feature Engineering, explains that knowing the features that should be used to create a predictive model is not only vital but also a difficult question that may require deep knowledge of the problem domain to be examined. It is possible to automatically select those features in data that are most useful or most relevant for the problem someone is working on. Considering these questions, this chapter covers feature engineering in detail, explaining the reasons to apply it along with some best practices in feature engineering.

In addition to this, theoretical descriptions and examples of feature extraction, transformations, and selection applied to large-scale machine learning technique using both Spark MLlib and Spark ML APIs will be discussed. 

Chapter 5, Supervised and Unsupervised Learning by Examples, will provide the practical knowledge surrounding how to apply supervised and unsupervised techniques on the available data to new problems quickly and powerfully through some widely used examples based on the previous chapters. These examples will be demonstrated from the Spark perspective.

Chapter 6, Building Scalable Machine Learning Pipelines, explains that the ultimate goal of machine learning is to make a machine that can automatically build models from data without requiring tedious and time-consuming human involvement and interaction. Therefore, this chapter guides the readers through creating some practical and widely used machine learning pipelines and applications using Spark MLlib and Spark ML. Both APIs will be described in detail, and a baseline use case will also be covered for both. Then we will focus towards scaling up the ML application so that it can cope up with increasing data loads.

Chapter 7, Tuning Machine Learning Models, shows that tuning an algorithm or machine learning application can be simply thought of as a process by which one goes through and optimizes the parameters that impact the model in order to enable the algorithm to perform to its best. This chapter aims at guiding the reader through model tuning. It will cover the main techniques used to optimize an ML algorithm’s performance. Techniques will be explained both from the MLlib and Spark ML perspective. We will also show how to improve the performance of the ML models by tuning several parameters, such as hyperparameters, grid search parameters with MLlib and Spark ML, hypothesis testing, Random search parameter tuning, and Cross-validation.

Chapter 8, Adapting Your Machine Learning Models, covers advanced machine learning techniques that will make algorithms adaptable to new data and problem types. It will mainly focus on batch/streaming architectures and on online learning algorithms using Spark streaming. The ultimate target is to bring dynamism to static machine learning models. Readers will also see how the machine learning algorithms learn incrementally from the data, that is, the models are updated each time the algorithm sees a new training instance.  

Chapter 9, Advanced Machine Learning with Streaming and Graph Data, explains reader how to apply machine learning techniques, with the help of Spark MLlib and Spark ML, on streaming and graph data, for example, in topic modeling. The readers will be able to use available APIs to build real-time and predictive applications from streaming data sources such as Twitter. Through the Twitter data analysis, we will show how to perform large-scale social sentiment analysis. We will also show how to develop a large-scale movie recommendation system using Spark MLlib, which is an implicit part of social network analysis.

Chapter 10, Configuring and Working with External Libraries, guides the reader on using external libraries to expand their data analysis. Examples will be given for deploying third-party packages or libraries for machine learning applications with Spark core and ML/MLlib. We will also discuss how to compile and use external libraries with the core libraries of Spark for time series. As promised, we will also discuss how to configure SparkR to improve exploratory data manipulation and operations.

What you need for this book 

Software requirements:

Following software is required for chapters 1-8 and 10: Spark 2.0.0 (or higher), Hadoop 2.7 (or higher), Java (JDK and JRE) 1.7+/1.8+, Scala 2.11.x (or higher), Python 2.6+/3.4+, R 3.1+, and RStudio 0.99.879 (or higher) installed. Eclipse Mars or Luna (latest) can be used. Moreover, Maven Eclipse plugin (2.9 or higher), Maven compiler plugin for Eclipse (2.3.2 or higher) and Maven assembly plugin for Eclipse (2.4.1 or higher) are required. Most importantly, re-use the provided pom.xml file with Packt's supplements and change the previously-mentioned version and APIs accordingly and everything will be sorted out.

For Chapter 9, Advanced Machine Learning with Streaming and Graph Data, almost all the software required, mentioned previously, except for the Twitter data collection example, which will be shown in Spark 1.6.1. Therefore, Spark 1.6.1 or 1.6.2 is required, along with the Maven-friendly pom.xml file.

Operating system requirements:  

Spark can be run on a number of operating systems including Windows, Mac OS, and LINUX. However, Linux distributions are preferable (including Debian, Ubuntu, Fedora, RHEL, CentOS and so on). To be more specific, for example, for Ubuntu it is recommended to have a 14.04/15.04 (LTS) 64-bit complete installation or VMWare player 12 or Virtual Box.  For Windows, Windows (XP/7/8/10) and for Mac OS X (10.4.7+) is recommended.

Hardware requirements:

To work with Spark smoothly, a machine with at least a core i3 or core i5 processor is recommended.  However, to get the best results, core i7 would achieve faster data processing and scalability with at least 8 GB RAM (recommended) for a standalone mode and at least 32 GB RAM for a single VM, or higher for a cluster. Besides, enough storage to run heavy jobs (depending upon the data size you will be handling), and preferably at least 50 GB of free disk storage (for stand-alone and for SQL warehouse).

Who this book is for 

Python and R are two popular languages for data scientists due to the large number of modules or packages that are readily available to help them solve their data analytics problems. However, traditional uses of these tools are often limiting, as they process data on either a single machine or with main memory-based approaches where the movement of data becomes time-consuming, the analysis requires sampling, and moving from development to production environments requires extensive re-engineering. To address these issues, Spark provides data engineers and data scientists a powerful and unified engine that is both faster and easy to use. This allows you to solve their machine learning problems interactively and at much greater scale.

Therefore, if you are an academic, researcher, data science engineer, or even a big data engineer working with large and complex data sets. Furthermore, if you want to board your data processing pipelines and machine learning applications to scale up more quickly, this book would be a suitable companion to this journey. Moreover, Spark provides many language choices, including Scala, Java, and Python. This facility will definitely help you to lift your machine learning applications on top of Spark and reshape using any one of these programming languages with Spark.

You should be familiar with the basics of machine learning concepts at least. Knowledge of open source tools and frameworks such as Apache Spark and Hadoop-based MapReduce would be good, but is not essential. A solid background in statistics and computational mathematics is expected. In addition, knowledge of Scala, Python, and Java is advisable. However, if you are experienced with intermediate programming languages, this will help you to understand the discussions and examples demonstrated in this book.

Reader feedback

Feedback from our readers is always welcome. Let us know what you think about this book—what you liked or may have disliked. Reader feedback is important for us to develop titles that you really get the most out of.

To send us general feedback, simply send an e-mail to [email protected], and mention the book title via the subject of your message.

If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, see our author guide on https://www.packtpub.com/books/info/packt/authors.

Customer support

Now that you are the proud owner of a Packt book, we have a number of things to help you to get the most from your purchase.

Downloading the example code

You can download the example code files for all Packt books you have purchased from your account at http://www.packtpub.com. If you purchased this book elsewhere, you can visit http://www.packtpub.com/support and register to have the files e-mailed directly to you.

You can download the code files by following these steps:

Log in or register to our website using your e-mail address and password.Hover the mouse pointer on the SUPPORT tab at the top.Click on Code Downloads & Errata.Enter the name of the book in the Search box.Select the book for which you're looking to download the code files.Choose from the drop-down menu where you purchased this book from.Click on Code Download.

Once the file is downloaded, please make sure that you unzip or extract the folder using the latest version of:

WinRAR / 7-Zip for WindowsZipeg / iZip / UnRarX for Mac7-Zip / PeaZip for Linux

The code bundle for the book is also hosted on GitHub at https://github.com/PacktPublishing/Large-Scale-Machine-Learning-with-Spark. We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/. Check them out!

Errata

Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you find a mistake in one of our books—maybe a mistake in the text or the code—we would be grateful if you would report this to us. By doing so, you can save other readers from frustration and help us improve subsequent versions of this book. If you find any errata, please report them by visiting http://www.packtpub.com/submit-errata, selecting your book, clicking on the errata submission form link, and entering the details of your errata. Once your errata are verified, your submission will be accepted and the errata will be uploaded on our website, or added to any list of existing errata, under the Errata section of that title. Any existing errata can be viewed by selecting your title from http://www.packtpub.com/support.

Piracy

Piracy of copyright material on the Internet is an ongoing problem across all media. At Packt, we take the protection of our copyright and licenses very seriously. If you come across any illegal copies of our works, in any form, on the Internet, please provide us with the location address or website name immediately so that we can pursue a remedy.

Please contact us at [email protected] with a link to the suspected pirated material.

We appreciate your help in protecting our authors, and our ability to bring you valuable content.

Questions

You can contact us at [email protected] if you are having a problem with any aspect of the book, and we will do our best to address it.

Chapter 1. Introduction to Data Analytics with Spark

This chapter covers an overview of Apache Spark, its computing paradigm, and installation to getting started. It will briefly describe the main components of Spark and focus on its new computing advancements. A description of the Resilient Distributed Datasets (RDD) and Dataset will be discussed as a base knowledge for the rest of this book. It will then focus on the Spark machine learning libraries. Installing and packaging a simple machine learning application with Spark and Maven will be demonstrated then before getting on board. In a nutshell, the following topics will be covered in this chapter:

Spark overviewNew computing paradigm with SparkSpark ecosystemSpark machine learning librariesInstalling and getting started with SparkPackaging your application with dependenciesRunning a simple machine learning application

Spark overview

This section describes Spark (https://spark.apache.org/) basics followed by the issues with the traditional parallel and distributed computing, then how Spark was evolved, and it then brings a new computing paradigm across the big data processing and analytics on top of that. In addition, we also presented some exciting features of Spark that easily attract the big data engineers, data scientists, and researchers, including:

Simplicity of data processing and computationSpeed of computationScalability and throughput across large-scale datasetsSophistication across diverse data typesEase of cluster computing and deployment with different cluster managersWorking capabilities and supports with various big data storage and sourcesDiverse APIs are written in widely used and emerging programming languages

Spark basics

Before praising Spark and its many virtues, a short overview is in the mandate. Apache Spark is a fast, in-memory, big data processing, and general-purpose cluster computing framework with a bunch of sophisticated APIs for advanced data analytics. Unlike the Hadoop-based MapReduce, which is only suited for batch jobs in speed and ease of use, Spark could be considered as a general execution engine that is suitable for applying advanced analytics on both static (batch) as well as real-time data:

Spark was originally developed at the University of California, Berkeley's AMPLab based on Resilient Distributed Datasets (RDDs), which provides a fault-tolerant abstraction for in-memory cluster computing facilities. However, later on Spark's code base was bequeathed to the Apache Software Foundation making it open source, since then open source communities are taking care of it. Spark provides an interface to perform data analytics on entire clusters at scale with implicit data parallelism and fault-tolerance through its high-level APIs written in Java, Scala, Python, and R.

In Spark 2.0.0, elevated libraries (most widely used data analysis algorithms) are implemented, including:

Spark SQL for querying and processing large-scale structured dataSparkR for statistical computing that provides distributed computing using programming language R at scaleMLlib for machine learning (ML) applications, which is internally divided into two parts; MLlib for RDD-based machine learning application development and Spark ML for a high-level abstraction to develop complete computational data science and machine learning workflowsGraphX for large-scale graph data processingSpark Streaming for handling large-scale real-time streaming data to provide a dynamic working environment to static machine learning

Since its first stable release, Spark has already experienced dramatic and rapid development as well as wide adoptions through active initiatives from a wide range of IT solution providers, open source communities, and researchers around the world. Recently it has emerged as one of the most active, and the largest open source project in the area of big data processing and cluster computing, not only for its comprehensive adoptions, but also deployments and surveys by IT peoples, data scientists, and big data engineers worldwide. As quoted by Matei Zaharia, founder of Spark and the CTO of Databricks on the Big Data analytics news website at: http://bigdataanalyticsnews.com/apache-spark-3-real-world-use-cases/:

It's an interesting thing. There hasn't been as much noise about it commercially, but the actual developer community votes with its feet and people are actually getting things done and working with the project.

Even though many Tech Giants such as Yahoo, Baidu, Conviva, ClearStory, Hortonworks, Gartner, and Tencent are already using Spark in production - on the other hand, IBM, DataStax, Cloudera, and BlueData provide the commercialized Spark distribution for the enterprise. These companies have enthusiastically deployed Spark applications at a massive scale collectively for processing multiple petabytes of data on clusters of 8,000 nodes, which is the largest known cluster of Spark.

Beauties of Spark

Are you planning to develop a machine learning (ML) application? If so, you probably already have some data to perform preprocessing before you train a model on that data, and finally, you will be using the trained model to make predictions on new data to see the adaptability. That's all you need? We guess no, since you have to consider other parameters as well. Obviously, you will desire your ML models to be working perfectly in terms of accuracy, execution time, memory usage, throughput, tuning, and adaptability. Wait! Still not done yet; what happens if you would like to make your application handle large training and new datasets at scale? Or as a data scientist, what if you could build your ML models to overcome these issues as a multi-step journey from data incorporation through train and error to production by running the same machine learning code on the big cluster and the personal computer without breaking down further? You can simply rely on Spark and close your eyes.

Spark has several advantages over other big data technologies such as MapReduce (you can refer to https://hadoop.apache.org/docs/r1.2.1/mapred_tutorial.html for MapReduce tutorials and the research paper MapReduce: Simplified Data Processing on Large Clusters, Jeffrey Dean et al, In proc of OSDI, 2004 to get to know more) and Storm, which is a free and open source distributed real-time computation system (please refer to http://storm.apache.org/ for more on Storm-based distributed computing). First of all, Spark gives a comprehensive, unified engine to manage big data processing requirements with a variety of datasets such as text and tabular to graph data as well as the source of data (batch and real-time streaming data) that are diverse in nature. As a user (data science engineers, academicians, or developers), you can be likely benefited from Spark's rapid application development through simple and easy-to-understand APIs across batches, interactive, and real-time streaming applications.

Working and programming with Spark is easy and simple. Let us show you an example of that. Yahoo is one of the contributors and an early adopter of Spark, who implemented an ML algorithm with 120 lines of Scala code. With just 30 minutes of training on a large dataset with a hundred million records, the Scala ML algorithm was ready for business. Surprisingly, the same algorithm was written using C++ in 15,000 lines of code previously (please refer to the following URL for more at: https://www.datanami.com/2014/03/06/apache_spark_3_real-world_use_cases/). You can develop your applications using Java, Scala, R, or Python with a built-in set of over 100 high-level operators (mostly supported after Spark release 1.6.1) for transforming datasets and getting the familiarity with the data frame APIs for manipulating semi-structured, structured, and streaming data. In addition to the Map and Reduce operations, it supports SQL queries, streaming data, machine learning, and graph data processing. Moreover, Spark also provides an interactive shell written in Scala and Python for executing your codes sequentially (such as SQL or R style).

The main reason Spark adopts so quickly is because of two main factors: speed and sophistication. Spark provides order-of-magnitude performance for many applications using coarse-grained, immutable, and sophisticated data called Resilient Distributed Datasets that are spread across the cluster and that can be stored in memory or disks. An RDD offers fault-tolerance, which is resilient in a sense that it cannot be changed once created. Moreover, Spark's RDD has the property of recreating from its lineage if it is lost in the middle of computation. Furthermore, the RDD can be distributed automatically across the clusters by means of partitions and it holds your data. You can also keep it on your data on memory by the caching mechanism of Spark, and this mechanism enables big data applications in Hadoop-based MapReduce clusters to execute up to 100 times faster for in-memory if executed iteratively and even 10 times faster for disk-based operation.

Let's look at a surprising statistic about Spark and its computation powers. Recently, Spark took over Hadoop-based MapReduce by completing the 2014 Gray Sort Benchmark in the 100 TB category, which is an industry benchmark on how fast a system can sort 100 TB of data (1 trillion records) (please refer to http://spark.apache.org/news/spark-wins-daytona-gray-sort-100tb-benchmark.html and http://sortbenchmark.org/). Finally, it becomes the open source engine (please refer to the following URL for more information https://databricks.com/blog/2014/11/05/spark-officially-sets-a-new-record-in-large-scale-sorting.html) for sorting at petabyte scale. In comparison, the previous world record set by Hadoop MapReduce had to use 2100 machines, taking 72 minutes of execution time, which implies Spark sorted the same data three times faster using 10 times fewer machines. Moreover, you can combine multiple libraries seamlessly to develop large-scale machine learning and data analytics pipelines to execute the job on various cluster managers such as Hadoop YARN, Mesos, or in the cloud by accessing data storage and sources such as HDFS, Cassandra, HBase, Amazon S3, or even RDBMs. Moreover, the job can be executed as a standalone mode on a local PC or cluster, or even on AWS EC2. Therefore, deployment of a Spark application on the cluster is easy (we will show more on how to deploy a Spark application on the cluster later in this chapter).

The other beauties of Spark are: it is open source and platform independent. These two are also its greatest advantage, which is it's free to use, distribute, and modify and develop an application on any platform. An open source project is also more secure as the code is accessible to everyone and anyone can fix bugs as they are found. Consequently, Spark has evolved so rapidly that it has become the largest open source project concerning big data solutions with 750+ contributors from 200+ organizations.

New computing paradigm with Spark

In this section, we will show a chronology of Spark that will provide a concept of how it was evolved and emerged as a revolution for big data processing and cluster computing. In addition to this, we will also describe the Spark ecosystem in brief to understand the features and facilities of Spark in more details.

Traditional distributed computing

The traditional data processing paradigm is commonly referred to as a client-server model, which people used to move data to the code. The database server (or simply the server) was mainly responsible for performing data operations and then returning the results to the client-server (or simply the client) program. However, when the number of task to be computed is increased, a variety of operations and client devices also started to increase exponentially. As a result, a progressively complex array of computing endpoint in servers also started in the background. So to keep this type of computing model we need to increase the application (client) servers and database server in balance for storing and processing the increased number of operations. Consequently, the data propagation between nodes and data transfer back and forth across this network also increases drastically. Therefore, the network itself becomes a performance bottleneck. As a result, the performance (in terms of both the scalability and throughput) in this kind of computing paradigm also decreases undoubtedly. It is shown in the following diagram:

Figure 1: Traditional distributed processing in action.

After the successful completion of human genome projects in life sciences, real-time IOT data, sensor data, data from mobile devices, and data from the Web are creating the data-deluge and contributing for the big data, which has mostly evolved the data-intensive computing. The data-intensive computing nowadays is now flattering increasingly in an emerging way, which requires an integrated infrastructure or computing paradigm, so that the computational resources and data could be brought in a common platform and perform the analytics on top of it. The reasons are diverse because big data is really huge in terms of complexity (volume, variety, and velocity), and from the operational perspective four ms (that is, move, manage, merge, and munge).

In addition, since we will be talking about large-scale machine learning applications in this book, we also need to consider some addition and critical assessing parameters such as validity, veracity, value, and visibility to grow the business. Visibility is important, because suppose you have a big dataset with a size of 1 PB; however, if there is no visibility, everything is a black hole. We will explain more on big data values in upcoming chapters.

It may not be feasible to store and process these large-scale and complex big datasets in a single system; therefore, they need to be partitioned and stored across multiple physical machines. Well, big datasets are partitioned or distributed, but to process and analyze these rigorously complex datasets, both the database servers as well as application servers might need to be increased to intensify the processing power at a large-scale. Again, the same performance bottleneck issues arrive at worst in multi-dimension that requires a new and more data-intensive big data processing and related computing paradigm.

Moving code to the data

To overcome the issues mentioned previously, a new computing paradigm is desperately needed so that instead of moving data to the code/application, we could move the code or application to the data and perform the data manipulation, processing, and associated computing at home (that is, where the data is stored). As you understand the motivation and objective, now the reverts programming model can be called move code to data and do parallel processing on distributed system, which can be visualized in the following diagram:

Figure 2: New computing (move code to data and do parallel processing on distributed system).

To understand the workflows illustrated in Figure 2, we can envisage a new programming model described as follows:

Execution of a big data processing using your application initiated at your personal computer (let's name it Driver Program), which coordinates the execution in action remotely across multiple computing nodes within a cluster or grid, or a more openly speaking cloud.Now what you need is to transfer your developed application/algorithm/code segments (could be invoked or revoked using command-line or shell scripting as a simple programming language notation) to the computing/worker nodes (having large storage, main memory, and processing capability). We can simply imagine that the data to be computed or manipulated is already stored in those computing nodes as partitions or blocks.It is also understandable that the bulk data no longer needs to be transferred (upload/download) to your driver program because of the network or computing bottleneck, but it only holds the data reference in its variable instead, which is basically an address (hostname/IP address with a port) to locate the physical data stored in the computing nodes in a cluster, for example (of course bulk-upload could be performed using other solutions, such as scalable provisioning that is to be discussed in later chapters).So what do the remote computing nodes have? They have the data as well as code to perform the data computations and necessary processing to materialize the output or modified data without leaving their home (more technically, the computing nodes).Finally, upon your request, only the results could be transferred across the network to your driver program for validation or other analytics since there are many subsets of the original datasets.

It's worth noticing that by moving the code to the data, the computing structure has been changed drastically. Most interestingly, the volume of data transfer across the network has significantly reduced. The justification here is that you will be transferring only a small chunk of software code to the computing nodes and receiving a small subset of the original data as results in return. This was the most important paradigm shifting for big data processing that Spark brought to us with the concept of RDD, datasets, DataFrame, and other lucrative features that imply great revolution in the history of big data engineering and cluster computing. However, for brevity, in the next section we will only discuss the concepts of RDD and the other computing features will be discussed in upcoming sections

RDD – a new computing paradigm

To understand the new computing paradigm, we need to understand the concept of Resilient Distributed Datasets (RDDs), by which and how Spark has implemented the data reference concept. As a result, it has been able to shift the data processing easily to take it at scale. The basic thing about RDD is that it helps you to treat your input datasets almost like any other data objects. In other words, it brings the diversity of input data types, which you greatly missed in the Hadoop-based MapReduce framework.

An RDD provides the fault-tolerance capability in a resilient way in a sense that it cannot be changed once created and the Spark engine will try to iterate the operation once failed. It is distributed because once it has created performed partition operations, RDDs are automatically distributed across the clusters by means of partitions. RDDs let you play more with your input datasets since RDDs can also be transformed into other forms rapidly and robustly. In parallel, RDDs can also be dumped through an action and shared across your applications that are logically co-related or computationally homogeneous. This is achievable because it is a part of Spark's general-purpose execution engine to gain massive parallelism, so it can virtually be applied in any type of datasets.

However, to make the RDD and related operation on your inputs, Spark engines require you to make a distinguishing borderline between the data pointer (that is, the reference) and the input data itself. Basically, your driver program will not hold data, but only the reference of the data where the data is actually located on the remote computing nodes in a cluster.

To make the data processing faster and easier, Spark supports two types of operations, which can be performed on RDDs: transformations and actions (please refer to Figure 3). A transformation operation basically creates a new dataset from an existing one. An action, on the other hand, materializes a value to the driver program after a successful computation on input datasets on the remote server (computing nodes to be more exact).

The style of data execution initiated by the driver program builds up a graph as a Directed Acyclic Graph (DAG) style; where nodes represent RDDs and the transformation operations are represented by the edges. However, the execution itself does not start in the computing nodes in a Spark cluster until an action operation is performed. Nevertheless, before starting the operation, the driver program sends the execution graph (that signifies the style of operation for the data computation pipelining or workflows) and the code block (as a domain-specific script or file) to the cluster and each worker/computing node receives a copy from the cluster manager node:

Figure 3: RDD in action (transformation and action operation).

Before proceeding to the next section, we argue you to learn about the action and transformation operation in more detail. Although we will discuss these two operations in Chapter 3, Understanding the Problem by Understanding the Data in detail. There are two types of transformation operations currently supported by Spark. The first one is the narrow transformation, where data mingle is unnecessary. Typical Spark narrow transformation operations are performed using the filter(), sample(), map(), flatMap(), mapPartitions() , and other methods. The wide transformation is essential to make a wider change to your input datasets so that the data could be brought in a common node out of multiple partitions of data shuffling. Wide transformation operations include groupByKey(), reduceByKey(), union(), intersection(), join(), and so on.

An action operation returns the final results of RDD computations from the transformation by triggering execution as a Directed Acyclic Graph (DAG) style to the Driver Program. But the materialized results are actually written on the storage, including the intermediate transformation results of the data objects and return the final results. Common actions include: first(), take(), reduce(), collect(), count(), saveAsTextFile(), saveAsSequenceFile(), and so on. At this point we believe that you have gained the basic operation on top of RDDs, so we can now define an RDD and related programs in a natural way. A typical RDD programming model that Spark provides can be described as follows:

From an environment variable, Spark Context (Spark shell or Python Pyspark provides you with a Spark Context or you can make your own, this will be described later in this chapter) creates an initial data reference RDD object.Transform the initial RDD to create more RDDs objects following the functional programming style (to be discussed later on).Send the code/algorithms/applications from the driver program to the cluster manager nodes. Then the cluster manager provides a copy to each computing node.Computing nodes hold a reference of the RDDs in its partition (again, the driver program also holds a data reference). However, computing nodes could have the input dataset provided by the cluster manager as well.After a transformation (via either narrow or wider transformation), the result to be generated is a brand new RDD, since the original one will not be mutated. Finally, the RDD object or more (specifically data reference) is materialized through an action to dump the RDD into the storage.The Driver Program can request the computing nodes for a chunk of results for the analysis or visualization of a program.

Wait! So far we have moved smoothly. We suppose you will ship your application code to the computing nodes in the cluster. Still you will have to upload or send the input datasets to the cluster to be distributed among the computing nodes. Even during the bulk-upload you will have to transfer the data across the network. We also argue that the size of the application code and results are negligible or trivial. Another obstacle is if you/we want Spark to process the data at scale computation, it might require data objects to be merged from multiple partitions first. That means we will need to shuffle data among the worker/computing nodes that are usually done by partition(), intersection(), and join() transformation operations.

So frankly speaking, the data transfer has not been eliminated completely. As we and you understand the overheads being contributed especially for the bulk upload/download of these operations, their corresponding outcomes are as follows:

Figure 4: RDD in action (the caching mechanism).

Well, it's true that we have been compromised with these burdens. However, situations could be tackled or reduced significantly using the caching mechanism of Spark. Imagine you are going to perform an action multiple times on the same RDD objects, which have a long lineage; this will cause an increase in execution time as well as data movement inside a computing node. You can remove (or at least reduce) this redundancy with the caching mechanism of Spark (Figure 4) that stores the computed result of the RDD in the memory. This eliminates the recurrent computation every time. Because, when you cache on an RDD, its partitions are loaded into the main memory instead of a disk (however, if there is not enough space in the memory, the disk will be used instead) of the nodes that hold it. This technique enables big data applications on Spark clusters to outperform MapReduce significantly for each round of parallel processing. We will discuss more on Spark data manipulations and other techniques in Chapter 3, Understanding the Problem by Understanding the Data in detail.

Spark ecosystem

To provide more enhancements and additional big data processing capabilities, Spark can be configured and run on top of existing Hadoop-based clusters. As already stated, although Hadoop provides the Hadoop Distributed File System (HDFS) for efficient and operational storing of large-scale data cheaply; however, MapReduce provides the computation fully disk-based. Another limitation of MapReduce is that; only simple computations can be executed with a high-latency batch model, or static data to be more specific. The core APIs in Spark, on the other hand, are written in Java, Scala, Python, and R. Compared to MapReduce, with the more general and powerful programming model, Spark also provides several libraries that are part of the Spark ecosystems for redundant capabilities in big data analytics, processing, and machine learning areas. The Spark ecosystem consists of the following components, as shown in Figure 5:

Figure 5: Spark ecosystem (till date up to Spark 1.6.1).

As we have already stated, it is very much possible to combine these APIs seamlessly to develop large-scale machine learning and data analytics applications. Moreover, the job can be executed on various cluster managers such as Hadoop YARN, Mesos, standalone, or in the cloud by accessing data storage and sources such as HDFS, Cassandra, HBase, Amazon S3, or even RDBMs.

Nevertheless, Spark is enriched with other features and APIs. For example, recently Cisco has announced to invest $150M in the Spark ecosystem towards Cisco Spark Hybrid Services (http://www.cisco.com/c/en/us/solutions/collaboration/cloud-collaboration/index.html). So Cisco Spark open APIs could boost its popularity with developers in higher cardinality (highly secure collaboration and connecting smartphone systems to the cloud). Beyond this, Spark has recently integrated Tachyon (http://ampcamp.berkeley.edu/5/exercises/tachyon.html), a distributed in-memory storage system that economically fits in memory to further improve Spark's performance.

Spark core engine

Spark itself is written in Scala, which is functional, as well as Object Oriented Programming Language (OOPL) which runs on top of JVM. Moreover, as mentioned in Figure 5, Spark's ecosystem is built on top of the general and core execution engine, which has some extensible API's implemented in different languages. The lower level layer or upper level layer also uses the Spark core engine as a general execution job performing engine and it provides all other functionality on top. The Spark Core is written in Scala as already mentioned, and it runs on Java Virtual Machine (JVM) and the high-level APIs (that is, Spark MLlib, SparkR, Spark SQL, Dataset, DataFrame, Spark Streaming, and GraphX) that use the core in the execution time.

Spark has brought the in-memory computing mode to a great visibility. This concept (in-memory computing) enables the Spark core engine to leverage speed through a generalized execution model to develop diverse applications.

The low-level implementation of general purpose data computing and machine learning algorithms written in Java, Scala, R, and Python are easy to use for big data application development. The Spark framework is built on Scala, so developing ML applications in Scala can provide access to the latest features that might not be available in other Spark languages initially. However, that is not a big problem, open source communities also take care of the necessity of developers around the globe. Therefore, if you do need a particular machine learning algorithm to be developed, and you want to add it to the Spark library, you can contribute it to the Spark community. The source code of Spark is openly available on GitHub at https://github.com/apache/spark as Apache Spark mirror. You can do a pull out request and the open source community will review your changes or algorithm before adding it to the master branch. For more information, please check the Spark Jira confluence site at https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark.

Python was a great arsenal for data scientists previously, and the contribution of Python in Spark is also not different. That means Python also has some excellent libraries for data analysis and processing; however, it is comparatively slower than Scala. R on the other hand, has a rich environment for data manipulation, data pre-processing, graphical analysis, machine learning, and statistical analysis, which can help to increase the developer's productivity. Java is definitely a good choice for developers who are coming from the Java and Hadoop background. However, Java also has the similar problem as Python, since Java is also slower than Scala.

A recent survey presented on the Databricks website at http://go.databricks.com/2015-spark-survey