41,99 €
Discover everything you need to build robust machine learning applications with Spark 2.0
This book is for data science engineers and scientists who work with large and complex data sets. You should be familiar with the basics of machine learning concepts, statistics, and computational mathematics. Knowledge of Scala and Java is advisable.
Data processing, implementing related algorithms, tuning, scaling up and finally deploying are some crucial steps in the process of optimising any application.
Spark is capable of handling large-scale batch and streaming data to figure out when to cache data in memory and processing them up to 100 times faster than Hadoop-based MapReduce. This means predictive analytics can be applied to streaming and batch to develop complete machine learning (ML) applications a lot quicker, making Spark an ideal candidate for large data-intensive applications.
This book focuses on design engineering and scalable solutions using ML with Spark. First, you will learn how to install Spark with all new features from the latest Spark 2.0 release. Moving on, you'll explore important concepts such as advanced feature engineering with RDD and Datasets. After studying developing and deploying applications, you will see how to use external libraries with Spark.
In summary, you will be able to develop complete and personalised ML applications from data collections,model building, tuning, and scaling up to deploying on a cluster or the cloud.
This book takes a practical approach where all the topics explained are demonstrated with the help of real-world use cases.
Sie lesen das E-Book in den Legimi-Apps auf:
Seitenzahl: 526
Copyright © 2016 Packt Publishing
All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.
Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the authors, nor Packt Publishing, and its dealers and distributors will be held liable for any damages caused or alleged to be caused directly or indirectly by this book.
Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.
First published: October 2016
Production reference: 1201016
Published by Packt Publishing Ltd.
Livery Place
35 Livery Street
Birmingham
B3 2PB, UK.
ISBN 978-1-78588-874-8
www.packtpub.com
Authors
Md. Rezaul Karim
Md. Mahedi Kaysar
Copy Editor
Safis Editing
Reviewer
Muthukumar Subramanian
Project Coordinator
Shweta H Birwatkar
Commissioning Editor
Akram Hussain
Proofreader
Safis Editing
Acquisition Editor
Lester Frias
Indexer
Aishwarya Gangawane
Content Development Editor
Amrita Noronha
Graphics
Disha Haria
Technical Editor
Akash Patel
Production Coordinator
Arvindkumar Gupta
Md. Rezaul Karim has more than 8 years of experience in the area of research and development with a solid knowledge of algorithms and data structures, focusing C, C++, Java, R, and Python and big data technologies such as Spark, Kafka, DC/OS, Docker, Mesos, Hadoop, and MapReduce.
He was first enchanted by machine learning while studying an Advanced Artificial Intelligence post-graduate course by applying the combined technique of Hadoop-based MapReduce and machine learning together for market basket analysis on large-scale business-oriented transactional databases in back 2010. Consequently, his research interests include machine learning, data mining, Semantic Web, big data, and bioinformatics. He has published more than 30 research papers in renowned peer-reviewed international journals and conferences focusing on the areas of data mining, machine learning, and bioinformatics, with good citations.
He is a Software Engineer and Researcher currently working at the Insight Centre for Data Analytics, Ireland (the largest data analytics center in Ireland and the largest Semantic Web research institute in the world) as a PhD Researcher. He is also a PhD candidate at the National University of Ireland, Galway. He also holds an ME (Master of Engineering) degree in Computer Engineering from the Kyung Hee University, Korea, majoring in data mining and knowledge discovery. And he has a BS (Bachelor of Science) degree in Computer Science from the University of Dhaka, Bangladesh.
Before joining the Insight Center for Data Analytics, he had been working as a Lead Software Engineer with Samsung Electronics, where he worked with the distributed Samsung R&D centers across the world, including Korea, India, Vietnam, Turkey, UAE, Brazil, and Bangladesh. Before that, he worked as a Graduate Research Assistant in the Database Lab at Kyung Hee University, Korea, while working towards his Master's degree. He also worked as an R&D Engineer with BMTech21 Worldwide, Korea. Even before that, he worked as a Software Engineer with i2SoftTechnology, Dhaka, Bangladesh.
This book could not have been written without the support of my family and friends. In particular, my wife Saroar and my son Shadman deserve many thanks for their patience and encouragement throughout the past year.
I would like to give special thanks to Md. Mahedi Kaysar for co-authoring this book, and without his contributions, the writing would have been impossible. I would also like to thank Jaynal Abedin for his valuable suggestions towards different statistical and machine algorithms. Overall, I would like to dedicate this book to my respected teacher and research Guru Prof. Dr. Chowdhury Farhan Ahmed (Dept. of Computer Science & Engineering, University of Dhaka, Bangladesh) for his endless contributions to my life.
Further, I would like to thank the acquisition, content development and technical editors of Packt Publishing (and others who were involved to this book title) for their sincere cooperation and coordination. Additionally, without the work of numerous researchers who shared their expertise in publications, lectures, and source code, this book might not exist at all! Finally, I appreciate the efforts of the Apache Spark community and all those who have contributed to Spark APIs, whose work ultimately brought the machine learning to the masses.
Md. Mahedi Kaysar is a Software Engineer and Researcher at the Insight Center for Data Analytics (the largest data analytics center across the Ireland and the largest semantic web research institute in the world), Dublin City University (DCU), Ireland. Before joining the Insight Center at DCU, he worked as a Software Engineer at the Insight Center for Data Analytics, National University of Ireland, Galway and Samsung Electronics, Bangladesh.
He has more than 5 years of experience in research and development with a strong background in algorithms and data structures concentrating on C, Java, Scala, and Python. He has lots of experience in enterprise application development and big data analytics.
He obtained a BSc in Computer Science and Engineering from the Chittagong University of Engineering and Technology, Bangladesh. Now, he has started his postgraduate research in Distributed and Parallel Computing at the Dublin City University, Ireland.
His research interests include Distributed Computing, Semantic Web, Linked Data, big data, Internet of Everything, and machine learning. Moreover, he was involved in a research project in collaboration with CISCO Systems Inc. in the area of Internet of Everything and Semantic Web Technologies. His duties were to develop an IoT-enabled meeting management system, a scalable system for stream processing, designing, and showcasing the use cases of a project.
This book could not have been written without the support of my family and friends. In particular, my beloved wife IrinAkhtar deserves many thanks for her patience and encouragement throughout the past year.
I would like to give special thanks to Md. Rezaul Karim for co-authoring this book and without his contributions, the writing would have been impossible. I would also like thank Packt Publishing and the group members who provides us a lot of support with the writing of this book, which helped to complete the book in time.
Muthukumar Subramanian (https://www.linkedin.com/in/muthu4all) is an iSMAC (IoT, Social, Mobility, Analytics, Cloud) technologist, passionate about creating solutions that will yield better results. He is a new technology advisor and a startup incubator implementing innovative approaches, products, and solutions by embracing the latest technologies such as Spark Streaming, Spark Graph Processing, Lambda Architecture in big data Ecosystem, NoSQL, Apache Flink, and Amazon Web Services. He has vast training and consulting experience with Apache Spark and MLlib for various project implementation for many professionals and organisations from various domains and verticals.
For support files and downloads related to your book, please visit www.PacktPub.com.
Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.PacktPub.com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at [email protected] for more details.
At www.PacktPub.com, you can also read a collection of free technical articles, sign up for a range of free newsletters and receive exclusive discounts and offers on Packt books and eBooks.
https://www.packtpub.com/mapt
Get the most in-demand software skills with Mapt. Mapt gives you full access to all Packt books and video courses, as well as industry-leading tools to help you plan your personal development and advance your career.
Machine learning, at its core, is concerned with algorithms that transform raw data into information into actionable intelligence. This fact makes machine learning well suited to the predictive analytics of Big Data. Without machine learning, therefore, it would be nearly impossible to keep up with these massive streams of information altogether. Spark, which is relatively a new and emerging technology, provides big data engineers and data scientists a powerful response and a unified engine that is both faster and easy to use.
This allows learners from numerous areas to solve their machine learning problems interactively and at much greater scale. The book is designed to enable data scientists, engineers, and researchers to develop and deploy their machine learning applications at scale so that they can learn how to handle large data clusters in data intensive environments to build powerful machine learning models.
The contents of the books have been written in a bottom-up approach from Spark and ML basics, exploring data with feature engineering, building scalable ML pipelines, tuning and adapting them through for the new data and problem types, and finally, model building to deployment. To clarify more, we have provided the chapters outline in such a way that a new reader with a minimum of knowledge of machine learning and programming with Spark will be able to follow the examples and move towards some real-life machine learning problems and their solutions.
Chapter 1, Introduction to Data Analytics with Spark, this chapter covers Spark's overview, its computing paradigm, installation, and help us get started with Spark. It will briefly describe the main components of Spark and focus on its new computing advancements with the Resilient Distributed Datasets (RDD) and Dataset. It will then focus on the Spark’s ecosystem of machine learning libraries. Installing, configuring, and packaging a simple machine learning application with Spark and Maven will be demonstrated before scaling up on Amazon EC2.
Chapter 2, Machine Learning Best Practices, provides a conceptual introduction to statistical machine learning (ML) techniques aiming to take a newcomer from a minimal knowledge of machine learning all the way to being a knowledgeable practitioner in a few steps. The second part of the chapter is focused on providing some recommendations for choosing the right machine learning algorithms depending on its application types and requirements. It will then go through some best practices when applying large-scale machine learning pipelines.
Chapter 3, Understanding the Problem by Understanding the Data, covers in detail the Dataset and Resilient Distributed Dataset (RDD) APIs for working with structured data, aiming to provide a basic understanding of machine learning problems with the available data. By the end, you will be able to deal with basic and complex data manipulation with ease. Some comparisons will be made available with basic abstractions in Spark using RDD and Dataset-based data manipulation to show gains both in terms of programming and performance. In addition, we will guide you on the right track so that you will be able to use Spark to persist an RDD or data object in memory, allowing it to be reused efficiently across the parallel operations in the later stage.
Chapter 4, Extracting Knowledge through Feature Engineering, explains that knowing the features that should be used to create a predictive model is not only vital but also a difficult question that may require deep knowledge of the problem domain to be examined. It is possible to automatically select those features in data that are most useful or most relevant for the problem someone is working on. Considering these questions, this chapter covers feature engineering in detail, explaining the reasons to apply it along with some best practices in feature engineering.
In addition to this, theoretical descriptions and examples of feature extraction, transformations, and selection applied to large-scale machine learning technique using both Spark MLlib and Spark ML APIs will be discussed.
Chapter 5, Supervised and Unsupervised Learning by Examples, will provide the practical knowledge surrounding how to apply supervised and unsupervised techniques on the available data to new problems quickly and powerfully through some widely used examples based on the previous chapters. These examples will be demonstrated from the Spark perspective.
Chapter 6, Building Scalable Machine Learning Pipelines, explains that the ultimate goal of machine learning is to make a machine that can automatically build models from data without requiring tedious and time-consuming human involvement and interaction. Therefore, this chapter guides the readers through creating some practical and widely used machine learning pipelines and applications using Spark MLlib and Spark ML. Both APIs will be described in detail, and a baseline use case will also be covered for both. Then we will focus towards scaling up the ML application so that it can cope up with increasing data loads.
Chapter 7, Tuning Machine Learning Models, shows that tuning an algorithm or machine learning application can be simply thought of as a process by which one goes through and optimizes the parameters that impact the model in order to enable the algorithm to perform to its best. This chapter aims at guiding the reader through model tuning. It will cover the main techniques used to optimize an ML algorithm’s performance. Techniques will be explained both from the MLlib and Spark ML perspective. We will also show how to improve the performance of the ML models by tuning several parameters, such as hyperparameters, grid search parameters with MLlib and Spark ML, hypothesis testing, Random search parameter tuning, and Cross-validation.
Chapter 8, Adapting Your Machine Learning Models, covers advanced machine learning techniques that will make algorithms adaptable to new data and problem types. It will mainly focus on batch/streaming architectures and on online learning algorithms using Spark streaming. The ultimate target is to bring dynamism to static machine learning models. Readers will also see how the machine learning algorithms learn incrementally from the data, that is, the models are updated each time the algorithm sees a new training instance.
Chapter 9, Advanced Machine Learning with Streaming and Graph Data, explains reader how to apply machine learning techniques, with the help of Spark MLlib and Spark ML, on streaming and graph data, for example, in topic modeling. The readers will be able to use available APIs to build real-time and predictive applications from streaming data sources such as Twitter. Through the Twitter data analysis, we will show how to perform large-scale social sentiment analysis. We will also show how to develop a large-scale movie recommendation system using Spark MLlib, which is an implicit part of social network analysis.
Chapter 10, Configuring and Working with External Libraries, guides the reader on using external libraries to expand their data analysis. Examples will be given for deploying third-party packages or libraries for machine learning applications with Spark core and ML/MLlib. We will also discuss how to compile and use external libraries with the core libraries of Spark for time series. As promised, we will also discuss how to configure SparkR to improve exploratory data manipulation and operations.
Software requirements:
Following software is required for chapters 1-8 and 10: Spark 2.0.0 (or higher), Hadoop 2.7 (or higher), Java (JDK and JRE) 1.7+/1.8+, Scala 2.11.x (or higher), Python 2.6+/3.4+, R 3.1+, and RStudio 0.99.879 (or higher) installed. Eclipse Mars or Luna (latest) can be used. Moreover, Maven Eclipse plugin (2.9 or higher), Maven compiler plugin for Eclipse (2.3.2 or higher) and Maven assembly plugin for Eclipse (2.4.1 or higher) are required. Most importantly, re-use the provided pom.xml file with Packt's supplements and change the previously-mentioned version and APIs accordingly and everything will be sorted out.
For Chapter 9, Advanced Machine Learning with Streaming and Graph Data, almost all the software required, mentioned previously, except for the Twitter data collection example, which will be shown in Spark 1.6.1. Therefore, Spark 1.6.1 or 1.6.2 is required, along with the Maven-friendly pom.xml file.
Operating system requirements:
Spark can be run on a number of operating systems including Windows, Mac OS, and LINUX. However, Linux distributions are preferable (including Debian, Ubuntu, Fedora, RHEL, CentOS and so on). To be more specific, for example, for Ubuntu it is recommended to have a 14.04/15.04 (LTS) 64-bit complete installation or VMWare player 12 or Virtual Box. For Windows, Windows (XP/7/8/10) and for Mac OS X (10.4.7+) is recommended.
Hardware requirements:
To work with Spark smoothly, a machine with at least a core i3 or core i5 processor is recommended. However, to get the best results, core i7 would achieve faster data processing and scalability with at least 8 GB RAM (recommended) for a standalone mode and at least 32 GB RAM for a single VM, or higher for a cluster. Besides, enough storage to run heavy jobs (depending upon the data size you will be handling), and preferably at least 50 GB of free disk storage (for stand-alone and for SQL warehouse).
Python and R are two popular languages for data scientists due to the large number of modules or packages that are readily available to help them solve their data analytics problems. However, traditional uses of these tools are often limiting, as they process data on either a single machine or with main memory-based approaches where the movement of data becomes time-consuming, the analysis requires sampling, and moving from development to production environments requires extensive re-engineering. To address these issues, Spark provides data engineers and data scientists a powerful and unified engine that is both faster and easy to use. This allows you to solve their machine learning problems interactively and at much greater scale.
Therefore, if you are an academic, researcher, data science engineer, or even a big data engineer working with large and complex data sets. Furthermore, if you want to board your data processing pipelines and machine learning applications to scale up more quickly, this book would be a suitable companion to this journey. Moreover, Spark provides many language choices, including Scala, Java, and Python. This facility will definitely help you to lift your machine learning applications on top of Spark and reshape using any one of these programming languages with Spark.
You should be familiar with the basics of machine learning concepts at least. Knowledge of open source tools and frameworks such as Apache Spark and Hadoop-based MapReduce would be good, but is not essential. A solid background in statistics and computational mathematics is expected. In addition, knowledge of Scala, Python, and Java is advisable. However, if you are experienced with intermediate programming languages, this will help you to understand the discussions and examples demonstrated in this book.
Feedback from our readers is always welcome. Let us know what you think about this book—what you liked or may have disliked. Reader feedback is important for us to develop titles that you really get the most out of.
To send us general feedback, simply send an e-mail to [email protected], and mention the book title via the subject of your message.
If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, see our author guide on https://www.packtpub.com/books/info/packt/authors.
Now that you are the proud owner of a Packt book, we have a number of things to help you to get the most from your purchase.
You can download the example code files for all Packt books you have purchased from your account at http://www.packtpub.com. If you purchased this book elsewhere, you can visit http://www.packtpub.com/support and register to have the files e-mailed directly to you.
You can download the code files by following these steps:
Once the file is downloaded, please make sure that you unzip or extract the folder using the latest version of:
The code bundle for the book is also hosted on GitHub at https://github.com/PacktPublishing/Large-Scale-Machine-Learning-with-Spark. We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/. Check them out!
Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you find a mistake in one of our books—maybe a mistake in the text or the code—we would be grateful if you would report this to us. By doing so, you can save other readers from frustration and help us improve subsequent versions of this book. If you find any errata, please report them by visiting http://www.packtpub.com/submit-errata, selecting your book, clicking on the errata submission form link, and entering the details of your errata. Once your errata are verified, your submission will be accepted and the errata will be uploaded on our website, or added to any list of existing errata, under the Errata section of that title. Any existing errata can be viewed by selecting your title from http://www.packtpub.com/support.
Piracy of copyright material on the Internet is an ongoing problem across all media. At Packt, we take the protection of our copyright and licenses very seriously. If you come across any illegal copies of our works, in any form, on the Internet, please provide us with the location address or website name immediately so that we can pursue a remedy.
Please contact us at [email protected] with a link to the suspected pirated material.
We appreciate your help in protecting our authors, and our ability to bring you valuable content.
You can contact us at [email protected] if you are having a problem with any aspect of the book, and we will do our best to address it.
This chapter covers an overview of Apache Spark, its computing paradigm, and installation to getting started. It will briefly describe the main components of Spark and focus on its new computing advancements. A description of the Resilient Distributed Datasets (RDD) and Dataset will be discussed as a base knowledge for the rest of this book. It will then focus on the Spark machine learning libraries. Installing and packaging a simple machine learning application with Spark and Maven will be demonstrated then before getting on board. In a nutshell, the following topics will be covered in this chapter:
This section describes Spark (https://spark.apache.org/) basics followed by the issues with the traditional parallel and distributed computing, then how Spark was evolved, and it then brings a new computing paradigm across the big data processing and analytics on top of that. In addition, we also presented some exciting features of Spark that easily attract the big data engineers, data scientists, and researchers, including:
Before praising Spark and its many virtues, a short overview is in the mandate. Apache Spark is a fast, in-memory, big data processing, and general-purpose cluster computing framework with a bunch of sophisticated APIs for advanced data analytics. Unlike the Hadoop-based MapReduce, which is only suited for batch jobs in speed and ease of use, Spark could be considered as a general execution engine that is suitable for applying advanced analytics on both static (batch) as well as real-time data:
In Spark 2.0.0, elevated libraries (most widely used data analysis algorithms) are implemented, including:
Since its first stable release, Spark has already experienced dramatic and rapid development as well as wide adoptions through active initiatives from a wide range of IT solution providers, open source communities, and researchers around the world. Recently it has emerged as one of the most active, and the largest open source project in the area of big data processing and cluster computing, not only for its comprehensive adoptions, but also deployments and surveys by IT peoples, data scientists, and big data engineers worldwide. As quoted by Matei Zaharia, founder of Spark and the CTO of Databricks on the Big Data analytics news website at: http://bigdataanalyticsnews.com/apache-spark-3-real-world-use-cases/:
It's an interesting thing. There hasn't been as much noise about it commercially, but the actual developer community votes with its feet and people are actually getting things done and working with the project.
Even though many Tech Giants such as Yahoo, Baidu, Conviva, ClearStory, Hortonworks, Gartner, and Tencent are already using Spark in production - on the other hand, IBM, DataStax, Cloudera, and BlueData provide the commercialized Spark distribution for the enterprise. These companies have enthusiastically deployed Spark applications at a massive scale collectively for processing multiple petabytes of data on clusters of 8,000 nodes, which is the largest known cluster of Spark.
Are you planning to develop a machine learning (ML) application? If so, you probably already have some data to perform preprocessing before you train a model on that data, and finally, you will be using the trained model to make predictions on new data to see the adaptability. That's all you need? We guess no, since you have to consider other parameters as well. Obviously, you will desire your ML models to be working perfectly in terms of accuracy, execution time, memory usage, throughput, tuning, and adaptability. Wait! Still not done yet; what happens if you would like to make your application handle large training and new datasets at scale? Or as a data scientist, what if you could build your ML models to overcome these issues as a multi-step journey from data incorporation through train and error to production by running the same machine learning code on the big cluster and the personal computer without breaking down further? You can simply rely on Spark and close your eyes.
Spark has several advantages over other big data technologies such as MapReduce (you can refer to https://hadoop.apache.org/docs/r1.2.1/mapred_tutorial.html for MapReduce tutorials and the research paper MapReduce: Simplified Data Processing on Large Clusters, Jeffrey Dean et al, In proc of OSDI, 2004 to get to know more) and Storm, which is a free and open source distributed real-time computation system (please refer to http://storm.apache.org/ for more on Storm-based distributed computing). First of all, Spark gives a comprehensive, unified engine to manage big data processing requirements with a variety of datasets such as text and tabular to graph data as well as the source of data (batch and real-time streaming data) that are diverse in nature. As a user (data science engineers, academicians, or developers), you can be likely benefited from Spark's rapid application development through simple and easy-to-understand APIs across batches, interactive, and real-time streaming applications.
Working and programming with Spark is easy and simple. Let us show you an example of that. Yahoo is one of the contributors and an early adopter of Spark, who implemented an ML algorithm with 120 lines of Scala code. With just 30 minutes of training on a large dataset with a hundred million records, the Scala ML algorithm was ready for business. Surprisingly, the same algorithm was written using C++ in 15,000 lines of code previously (please refer to the following URL for more at: https://www.datanami.com/2014/03/06/apache_spark_3_real-world_use_cases/). You can develop your applications using Java, Scala, R, or Python with a built-in set of over 100 high-level operators (mostly supported after Spark release 1.6.1) for transforming datasets and getting the familiarity with the data frame APIs for manipulating semi-structured, structured, and streaming data. In addition to the Map and Reduce operations, it supports SQL queries, streaming data, machine learning, and graph data processing. Moreover, Spark also provides an interactive shell written in Scala and Python for executing your codes sequentially (such as SQL or R style).
The main reason Spark adopts so quickly is because of two main factors: speed and sophistication. Spark provides order-of-magnitude performance for many applications using coarse-grained, immutable, and sophisticated data called Resilient Distributed Datasets that are spread across the cluster and that can be stored in memory or disks. An RDD offers fault-tolerance, which is resilient in a sense that it cannot be changed once created. Moreover, Spark's RDD has the property of recreating from its lineage if it is lost in the middle of computation. Furthermore, the RDD can be distributed automatically across the clusters by means of partitions and it holds your data. You can also keep it on your data on memory by the caching mechanism of Spark, and this mechanism enables big data applications in Hadoop-based MapReduce clusters to execute up to 100 times faster for in-memory if executed iteratively and even 10 times faster for disk-based operation.
Let's look at a surprising statistic about Spark and its computation powers. Recently, Spark took over Hadoop-based MapReduce by completing the 2014 Gray Sort Benchmark in the 100 TB category, which is an industry benchmark on how fast a system can sort 100 TB of data (1 trillion records) (please refer to http://spark.apache.org/news/spark-wins-daytona-gray-sort-100tb-benchmark.html and http://sortbenchmark.org/). Finally, it becomes the open source engine (please refer to the following URL for more information https://databricks.com/blog/2014/11/05/spark-officially-sets-a-new-record-in-large-scale-sorting.html) for sorting at petabyte scale. In comparison, the previous world record set by Hadoop MapReduce had to use 2100 machines, taking 72 minutes of execution time, which implies Spark sorted the same data three times faster using 10 times fewer machines. Moreover, you can combine multiple libraries seamlessly to develop large-scale machine learning and data analytics pipelines to execute the job on various cluster managers such as Hadoop YARN, Mesos, or in the cloud by accessing data storage and sources such as HDFS, Cassandra, HBase, Amazon S3, or even RDBMs. Moreover, the job can be executed as a standalone mode on a local PC or cluster, or even on AWS EC2. Therefore, deployment of a Spark application on the cluster is easy (we will show more on how to deploy a Spark application on the cluster later in this chapter).
The other beauties of Spark are: it is open source and platform independent. These two are also its greatest advantage, which is it's free to use, distribute, and modify and develop an application on any platform. An open source project is also more secure as the code is accessible to everyone and anyone can fix bugs as they are found. Consequently, Spark has evolved so rapidly that it has become the largest open source project concerning big data solutions with 750+ contributors from 200+ organizations.
In this section, we will show a chronology of Spark that will provide a concept of how it was evolved and emerged as a revolution for big data processing and cluster computing. In addition to this, we will also describe the Spark ecosystem in brief to understand the features and facilities of Spark in more details.
The traditional data processing paradigm is commonly referred to as a client-server model, which people used to move data to the code. The database server (or simply the server) was mainly responsible for performing data operations and then returning the results to the client-server (or simply the client) program. However, when the number of task to be computed is increased, a variety of operations and client devices also started to increase exponentially. As a result, a progressively complex array of computing endpoint in servers also started in the background. So to keep this type of computing model we need to increase the application (client) servers and database server in balance for storing and processing the increased number of operations. Consequently, the data propagation between nodes and data transfer back and forth across this network also increases drastically. Therefore, the network itself becomes a performance bottleneck. As a result, the performance (in terms of both the scalability and throughput) in this kind of computing paradigm also decreases undoubtedly. It is shown in the following diagram:
Figure 1: Traditional distributed processing in action.
After the successful completion of human genome projects in life sciences, real-time IOT data, sensor data, data from mobile devices, and data from the Web are creating the data-deluge and contributing for the big data, which has mostly evolved the data-intensive computing. The data-intensive computing nowadays is now flattering increasingly in an emerging way, which requires an integrated infrastructure or computing paradigm, so that the computational resources and data could be brought in a common platform and perform the analytics on top of it. The reasons are diverse because big data is really huge in terms of complexity (volume, variety, and velocity), and from the operational perspective four ms (that is, move, manage, merge, and munge).
In addition, since we will be talking about large-scale machine learning applications in this book, we also need to consider some addition and critical assessing parameters such as validity, veracity, value, and visibility to grow the business. Visibility is important, because suppose you have a big dataset with a size of 1 PB; however, if there is no visibility, everything is a black hole. We will explain more on big data values in upcoming chapters.
It may not be feasible to store and process these large-scale and complex big datasets in a single system; therefore, they need to be partitioned and stored across multiple physical machines. Well, big datasets are partitioned or distributed, but to process and analyze these rigorously complex datasets, both the database servers as well as application servers might need to be increased to intensify the processing power at a large-scale. Again, the same performance bottleneck issues arrive at worst in multi-dimension that requires a new and more data-intensive big data processing and related computing paradigm.
To overcome the issues mentioned previously, a new computing paradigm is desperately needed so that instead of moving data to the code/application, we could move the code or application to the data and perform the data manipulation, processing, and associated computing at home (that is, where the data is stored). As you understand the motivation and objective, now the reverts programming model can be called move code to data and do parallel processing on distributed system, which can be visualized in the following diagram:
Figure 2: New computing (move code to data and do parallel processing on distributed system).
To understand the workflows illustrated in Figure 2, we can envisage a new programming model described as follows:
It's worth noticing that by moving the code to the data, the computing structure has been changed drastically. Most interestingly, the volume of data transfer across the network has significantly reduced. The justification here is that you will be transferring only a small chunk of software code to the computing nodes and receiving a small subset of the original data as results in return. This was the most important paradigm shifting for big data processing that Spark brought to us with the concept of RDD, datasets, DataFrame, and other lucrative features that imply great revolution in the history of big data engineering and cluster computing. However, for brevity, in the next section we will only discuss the concepts of RDD and the other computing features will be discussed in upcoming sections
To understand the new computing paradigm, we need to understand the concept of Resilient Distributed Datasets (RDDs), by which and how Spark has implemented the data reference concept. As a result, it has been able to shift the data processing easily to take it at scale. The basic thing about RDD is that it helps you to treat your input datasets almost like any other data objects. In other words, it brings the diversity of input data types, which you greatly missed in the Hadoop-based MapReduce framework.
An RDD provides the fault-tolerance capability in a resilient way in a sense that it cannot be changed once created and the Spark engine will try to iterate the operation once failed. It is distributed because once it has created performed partition operations, RDDs are automatically distributed across the clusters by means of partitions. RDDs let you play more with your input datasets since RDDs can also be transformed into other forms rapidly and robustly. In parallel, RDDs can also be dumped through an action and shared across your applications that are logically co-related or computationally homogeneous. This is achievable because it is a part of Spark's general-purpose execution engine to gain massive parallelism, so it can virtually be applied in any type of datasets.
However, to make the RDD and related operation on your inputs, Spark engines require you to make a distinguishing borderline between the data pointer (that is, the reference) and the input data itself. Basically, your driver program will not hold data, but only the reference of the data where the data is actually located on the remote computing nodes in a cluster.
To make the data processing faster and easier, Spark supports two types of operations, which can be performed on RDDs: transformations and actions (please refer to Figure 3). A transformation operation basically creates a new dataset from an existing one. An action, on the other hand, materializes a value to the driver program after a successful computation on input datasets on the remote server (computing nodes to be more exact).
The style of data execution initiated by the driver program builds up a graph as a Directed Acyclic Graph (DAG) style; where nodes represent RDDs and the transformation operations are represented by the edges. However, the execution itself does not start in the computing nodes in a Spark cluster until an action operation is performed. Nevertheless, before starting the operation, the driver program sends the execution graph (that signifies the style of operation for the data computation pipelining or workflows) and the code block (as a domain-specific script or file) to the cluster and each worker/computing node receives a copy from the cluster manager node:
Figure 3: RDD in action (transformation and action operation).
Before proceeding to the next section, we argue you to learn about the action and transformation operation in more detail. Although we will discuss these two operations in Chapter 3, Understanding the Problem by Understanding the Data in detail. There are two types of transformation operations currently supported by Spark. The first one is the narrow transformation, where data mingle is unnecessary. Typical Spark narrow transformation operations are performed using the filter(), sample(), map(), flatMap(), mapPartitions() , and other methods. The wide transformation is essential to make a wider change to your input datasets so that the data could be brought in a common node out of multiple partitions of data shuffling. Wide transformation operations include groupByKey(), reduceByKey(), union(), intersection(), join(), and so on.
An action operation returns the final results of RDD computations from the transformation by triggering execution as a Directed Acyclic Graph (DAG) style to the Driver Program. But the materialized results are actually written on the storage, including the intermediate transformation results of the data objects and return the final results. Common actions include: first(), take(), reduce(), collect(), count(), saveAsTextFile(), saveAsSequenceFile(), and so on. At this point we believe that you have gained the basic operation on top of RDDs, so we can now define an RDD and related programs in a natural way. A typical RDD programming model that Spark provides can be described as follows:
Wait! So far we have moved smoothly. We suppose you will ship your application code to the computing nodes in the cluster. Still you will have to upload or send the input datasets to the cluster to be distributed among the computing nodes. Even during the bulk-upload you will have to transfer the data across the network. We also argue that the size of the application code and results are negligible or trivial. Another obstacle is if you/we want Spark to process the data at scale computation, it might require data objects to be merged from multiple partitions first. That means we will need to shuffle data among the worker/computing nodes that are usually done by partition(), intersection(), and join() transformation operations.
So frankly speaking, the data transfer has not been eliminated completely. As we and you understand the overheads being contributed especially for the bulk upload/download of these operations, their corresponding outcomes are as follows:
Figure 4: RDD in action (the caching mechanism).
Well, it's true that we have been compromised with these burdens. However, situations could be tackled or reduced significantly using the caching mechanism of Spark. Imagine you are going to perform an action multiple times on the same RDD objects, which have a long lineage; this will cause an increase in execution time as well as data movement inside a computing node. You can remove (or at least reduce) this redundancy with the caching mechanism of Spark (Figure 4) that stores the computed result of the RDD in the memory. This eliminates the recurrent computation every time. Because, when you cache on an RDD, its partitions are loaded into the main memory instead of a disk (however, if there is not enough space in the memory, the disk will be used instead) of the nodes that hold it. This technique enables big data applications on Spark clusters to outperform MapReduce significantly for each round of parallel processing. We will discuss more on Spark data manipulations and other techniques in Chapter 3, Understanding the Problem by Understanding the Data in detail.
To provide more enhancements and additional big data processing capabilities, Spark can be configured and run on top of existing Hadoop-based clusters. As already stated, although Hadoop provides the Hadoop Distributed File System (HDFS) for efficient and operational storing of large-scale data cheaply; however, MapReduce provides the computation fully disk-based. Another limitation of MapReduce is that; only simple computations can be executed with a high-latency batch model, or static data to be more specific. The core APIs in Spark, on the other hand, are written in Java, Scala, Python, and R. Compared to MapReduce, with the more general and powerful programming model, Spark also provides several libraries that are part of the Spark ecosystems for redundant capabilities in big data analytics, processing, and machine learning areas. The Spark ecosystem consists of the following components, as shown in Figure 5:
Figure 5: Spark ecosystem (till date up to Spark 1.6.1).
As we have already stated, it is very much possible to combine these APIs seamlessly to develop large-scale machine learning and data analytics applications. Moreover, the job can be executed on various cluster managers such as Hadoop YARN, Mesos, standalone, or in the cloud by accessing data storage and sources such as HDFS, Cassandra, HBase, Amazon S3, or even RDBMs.
Nevertheless, Spark is enriched with other features and APIs. For example, recently Cisco has announced to invest $150M in the Spark ecosystem towards Cisco Spark Hybrid Services (http://www.cisco.com/c/en/us/solutions/collaboration/cloud-collaboration/index.html). So Cisco Spark open APIs could boost its popularity with developers in higher cardinality (highly secure collaboration and connecting smartphone systems to the cloud). Beyond this, Spark has recently integrated Tachyon (http://ampcamp.berkeley.edu/5/exercises/tachyon.html), a distributed in-memory storage system that economically fits in memory to further improve Spark's performance.
Spark itself is written in Scala, which is functional, as well as Object Oriented Programming Language (OOPL) which runs on top of JVM. Moreover, as mentioned in Figure 5, Spark's ecosystem is built on top of the general and core execution engine, which has some extensible API's implemented in different languages. The lower level layer or upper level layer also uses the Spark core engine as a general execution job performing engine and it provides all other functionality on top. The Spark Core is written in Scala as already mentioned, and it runs on Java Virtual Machine (JVM) and the high-level APIs (that is, Spark MLlib, SparkR, Spark SQL, Dataset, DataFrame, Spark Streaming, and GraphX) that use the core in the execution time.
Spark has brought the in-memory computing mode to a great visibility. This concept (in-memory computing) enables the Spark core engine to leverage speed through a generalized execution model to develop diverse applications.
The low-level implementation of general purpose data computing and machine learning algorithms written in Java, Scala, R, and Python are easy to use for big data application development. The Spark framework is built on Scala, so developing ML applications in Scala can provide access to the latest features that might not be available in other Spark languages initially. However, that is not a big problem, open source communities also take care of the necessity of developers around the globe. Therefore, if you do need a particular machine learning algorithm to be developed, and you want to add it to the Spark library, you can contribute it to the Spark community. The source code of Spark is openly available on GitHub at https://github.com/apache/spark as Apache Spark mirror. You can do a pull out request and the open source community will review your changes or algorithm before adding it to the master branch. For more information, please check the Spark Jira confluence site at https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark.
Python was a great arsenal for data scientists previously, and the contribution of Python in Spark is also not different. That means Python also has some excellent libraries for data analysis and processing; however, it is comparatively slower than Scala. R on the other hand, has a rich environment for data manipulation, data pre-processing, graphical analysis, machine learning, and statistical analysis, which can help to increase the developer's productivity. Java is definitely a good choice for developers who are coming from the Java and Hadoop background. However, Java also has the similar problem as Python, since Java is also slower than Scala.
A recent survey presented on the Databricks website at http://go.databricks.com/2015-spark-survey