Apache Spark 2.x Machine Learning Cookbook - Siamak Amirghodsi - E-Book

Apache Spark 2.x Machine Learning Cookbook E-Book

Siamak Amirghodsi

0,0
45,59 €

-100%
Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.

Mehr erfahren.
Beschreibung

Simplify machine learning model implementations with Spark

About This Book

  • Solve the day-to-day problems of data science with Spark
  • This unique cookbook consists of exciting and intuitive numerical recipes
  • Optimize your work by acquiring, cleaning, analyzing, predicting, and visualizing your data

Who This Book Is For

This book is for Scala developers with a fairly good exposure to and understanding of machine learning techniques, but lack practical implementations with Spark. A solid knowledge of machine learning algorithms is assumed, as well as hands-on experience of implementing ML algorithms with Scala. However, you do not need to be acquainted with the Spark ML libraries and ecosystem.

What You Will Learn

  • Get to know how Scala and Spark go hand-in-hand for developers when developing ML systems with Spark
  • Build a recommendation engine that scales with Spark
  • Find out how to build unsupervised clustering systems to classify data in Spark
  • Build machine learning systems with the Decision Tree and Ensemble models in Spark
  • Deal with the curse of high-dimensionality in big data using Spark
  • Implement Text analytics for Search Engines in Spark
  • Streaming Machine Learning System implementation using Spark

In Detail

Machine learning aims to extract knowledge from data, relying on fundamental concepts in computer science, statistics, probability, and optimization. Learning about algorithms enables a wide range of applications, from everyday tasks such as product recommendations and spam filtering to cutting edge applications such as self-driving cars and personalized medicine. You will gain hands-on experience of applying these principles using Apache Spark, a resilient cluster computing system well suited for large-scale machine learning tasks.

This book begins with a quick overview of setting up the necessary IDEs to facilitate the execution of code examples that will be covered in various chapters. It also highlights some key issues developers face while working with machine learning algorithms on the Spark platform. We progress by uncovering the various Spark APIs and the implementation of ML algorithms with developing classification systems, recommendation engines, text analytics, clustering, and learning systems. Toward the final chapters, we'll focus on building high-end applications and explain various unsupervised methodologies and challenges to tackle when implementing with big data ML systems.

Style and approach

This book is packed with intuitive recipes supported with line-by-line explanations to help you understand how to optimize your work flow and resolve problems when working with complex data modeling tasks and predictive algorithms. This is a valuable resource for data scientists and those working on large scale data projects.

Sie lesen das E-Book in den Legimi-Apps auf:

Android
iOS
von Legimi
zertifizierten E-Readern

Seitenzahl: 575

Veröffentlichungsjahr: 2017

Bewertungen
0,0
0
0
0
0
0
Mehr Informationen
Mehr Informationen
Legimi prüft nicht, ob Rezensionen von Nutzern stammen, die den betreffenden Titel tatsächlich gekauft oder gelesen/gehört haben. Wir entfernen aber gefälschte Rezensionen.



Apache Spark 2.x Machine Learning Cookbook

 

 

 

 

 

 

 

 

 

 

Over 100 recipes to simplify machine learning model implementations with Spark

 

 

 

 

 

 

 

 

 

 

Siamak Amirghodsi
Meenakshi Rajendran
Broderick Hall
Shuen Mei

 

 

 

 

BIRMINGHAM - MUMBAI

Apache Spark 2.x Machine Learning Cookbook

Copyright © 2017 Packt Publishing

 

All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the authors, nor Packt Publishing, and its dealers and distributors will be held liable for any damages caused or alleged to be caused directly or indirectly by this book.

Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

 

First published: September 2017

 

Production reference: 1200917

Published by Packt Publishing Ltd.
Livery Place
35 Livery Street
Birmingham
B3 2PB, UK.

ISBN 978-1-78355-160-6

 

www.packtpub.com

Credits

Authors

Siamak Amirghodsi

Meenakshi Rajendran

Broderick Hall

Shuen Mei

Copy Editor

Safis Editing

Reviewers

Sumit Pal

Mohammad Guller

Project Coordinator

Sheejal Shah

Commissioning Editor

Ashwin Nair

Proofreader

Safis Editing

Acquisition Editor

Vinay Argekar

Indexer

Rekha Nair

ContentDevelopmentEditor

Nikhil Borkar

Graphics

Kirk D'Penha

Technical Editor

Madhunikita Sunil Chindarkar

Production Coordinator

Melwyn Dsa

About the Authors

Siamak Amirghodsi (Sammy) is a world-class senior technology executive leader with an entrepreneurial track record of overseeing big data strategies, cloud transformation, quantitative risk management, advanced analytics, large-scale regulatory data platforming, enterprise architecture, technology road mapping, multi-project execution, and organizational streamlining in Fortune 20 environments in a global setting.

Siamak is a hands-on big data, cloud, machine learning, and AI expert, and is currently overseeing the large-scale cloud data platforming and advanced risk analytics build out for a tier-1 financial institution in the United States. Siamak's interests include building advanced technical teams, executive management, Spark, Hadoop, big data analytics, AI, deep learning nets, TensorFlow, cognitive models, swarm algorithms, real-time streaming systems, quantum computing, financial risk management, trading signal discovery, econometrics, long-term financial cycles, IoT, blockchain, probabilistic graphical models, cryptography, and NLP.

Siamak is fully certified on Cloudera's big data platform and follows Apache Spark, TensorFlow, Hadoop, Hive, Pig, Zookeeper, Amazon AWS, Cassandra, HBase, Neo4j, MongoDB, and GPU architecture, while being fully grounded in the traditional IBM/Oracle/Microsoft technology stack for business continuity and integration.

Siamak has a PMP designation. He holds an advanced degree in computer science and an MBA from the University of Chicago (ChicagoBooth), with emphasis on strategic management, quantitative finance, and econometrics.

 

 

Meenakshi Rajendran is a hands-on big data analytics and data governance manager with expertise in large-scale data platforming and machine learning program execution on a global scale. She is experienced in the end-to-end delivery of data analytics and data science products for leading financial institutions. Meenakshi holds a master's degree in business administration and is a certified PMP with over 13 years of experience in global software delivery environments. She not only understands the underpinnings of big data and data science technology but also has a solid understanding of the human side of the equation as well.

Meenakshi’s favorite languages are Python, R, Julia, and Scala. Her areas of research and interest are Apache Spark, cloud, regulatory data governance, machine learning, Cassandra, and managing global data teams at scale. In her free time, she dabbles in software engineering management literature, cognitive psychology, and chess for relaxation.

Broderick Hall is a hands-on big data analytics expert and holds a master’s degree in computer science with 20 years of experience in designing and developing complex enterprise-wide software applications with real-time and regulatory requirements at a global scale. He has an extensive experience in designing and building real-time financial applications for some of the largest financial institutions and exchanges in USA. He is a deep learning early adopter and is currently working on a large-scale cloud-based data platform with deep learning net augmentation.

Broderick has extensive experience working in healthcare, travel, real estate, and data center management. Broderick also enjoys his role as an adjunct professor, instructing courses in Java programming and object-oriented programming. He is currently focused on delivering real-time big data mission-critical analytics applications in the financial services industry.

Broderick has been actively involved with Hadoop, Spark, Cassandra, TensorFlow, and deep learning since the early days, while actively pursuing machine learning, cloud architecture, data platforms, data science, and practical applications in cognitive sciences. He enjoys programming in Scala, Python, R, Java, and Julia.

 

 

 

Shuen Mei is a big data analytic platforms expert with 15+ years of experience in the financial services industry. He is experienced in designing, building, and executing large-scale, enterprise-distributed financial systems with mission-critical low-latency requirements. He is certified in the Apache Spark, Cloudera Big Data platform, including Developer, Admin, and HBase.

Shuen is also a certified AWS solutions architect with emphasis on peta-byte range real-time data platform systems. Shuen is a skilled software engineer with extensive experience in delivering infrastructure, code, data architecture, and performance tuning solutions in trading and finance for Fortune 100 companies.

Shuen holds a master's degree in MIS from the University of Illinois. He actively follows Spark, TensorFlow, Hadoop, Spark, Cloud Architecture, Apache Flink, Hive, HBase, Cassandra, and related systems. He is passionate about Scala, Python, Java, Julia, cloud computing, machine learning algorithms, and deep learning at scale.

About the Reviewer

Sumit Pal, who has authored SQL on Big Data - Technology, Architecture, and Innovations by Apress, has more than 22 years of experience in the software industry in various roles, spanning companies from startups to enterprises. 

Sumit is an independent consultant working with big data, data visualization, and data science, and he is a software architect building end-to-end data-driven analytic systems.

Sumit has worked for Microsoft (SQL server development team), Oracle (OLAP development team), and Verizon (big data analytics team) in a career spanning 22 years. Currently, he works for multiple clients advising them on their data architectures and big data solutions, and does hands-on coding with Spark, Scala, Java, and Python.

Sumit has spoken at the following Big Data Conferences:

Data Summit NY, May 2017

Big Data Symposium Boston, May 2017

Apache Linux Foundation, May 2016, Vancouver, Canada,

Data Center World, March 2016, Las Vegas

Chicago, Nov 2015

Big Data Conferences in Global Big Data Conference in Boston, Aug 2015

Sumit has also developed a Big Data Analyst Training course for Experfy, more details of which can be found at https://www.experfy.com/training/courses/big-data-analyst.

Sumit has an extensive experience in building scalable systems across the stack from middle tier and data tier to visualization for analytics applications, using big data and NoSQL DB. He has deep expertise in database internals, data warehouses, dimensional modeling, data science with Java and Python, and SQL. 

Sumit started his career as a part of the SQL Server Development Team at Microsoft in 1996-97 and then as a core server engineer for Oracle Corporation at their OLAP Development team in Burlington, MA.

Sumit has also worked at Verizon as an Associate Director for big data architecture, where he strategized, managed, architected, and developed platforms and solutions for analytics and machine learning applications. He has also served as a chief architect at ModelN/LeapfrogRX (2006-2013), where he architected the middle-tier core analytics platform with open source OLAP engine (Mondrian) on J2EE and solved some complex Dimensional ETL, Modeling, and performance optimization problems.

Sumit has MS and BS in computer science. He hiked to the Mt. Everest Base camp in Oct, 2016.

 

 

www.PacktPub.com

For support files and downloads related to your book, please visit www.PacktPub.com.

Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.PacktPub.comand as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at [email protected] for more details.

At www.PacktPub.com, you can also read a collection of free technical articles, sign up for a range of free newsletters and receive exclusive discounts and offers on Packt books and eBooks.

https://www.packtpub.com/mapt

Get the most in-demand software skills with Mapt. Mapt gives you full access to all Packt books and video courses, as well as industry-leading tools to help you plan your personal development and advance your career.

Why subscribe?

Fully searchable across every book published by Packt

Copy and paste, print, and bookmark content

On demand and accessible via a web browser

Customer Feedback

Thanks for purchasing this Packt book. At Packt, quality is at the heart of our editorial process. To help us improve, please leave us an honest review.

If you'd like to join our team of regular reviewers, you can email us at [email protected]. We award our regular reviewers with free eBooks and videos in exchange for their valuable feedback. Help us be relentless in improving our products!

Table of Contents

Preface

What this book covers

What you need for this book

Who this book is for

Sections

Getting ready

How to do it…

How it works…

There's more…

See also

Conventions

Reader feedback

Customer support

Downloading the example code

Errata

Piracy

Questions

Practical Machine Learning with Spark Using Scala

Introduction

Apache Spark

Machine learning

Scala

Software versions and libraries used in this book

Downloading and installing the JDK

Getting ready

How to do it...

Downloading and installing IntelliJ

Getting ready

How to do it...

Downloading and installing Spark

Getting ready

How to do it...

Configuring IntelliJ to work with Spark and run Spark ML sample codes

Getting ready

How to do it...

There's more...

See also

Running a sample ML code from Spark

Getting ready

How to do it...

Identifying data sources for practical machine learning

Getting ready

How to do it...

See also

Running your first program using Apache Spark 2.0 with the IntelliJ IDE

How to do it...

How it works...

There's more...

See also

How to add graphics to your Spark program

How to do it...

How it works...

There's more...

See also

Just Enough Linear Algebra for Machine Learning with Spark

Introduction

Package imports and initial setup for vectors and matrices

How to do it...

There's more...

See also

Creating DenseVector and setup with Spark 2.0

How to do it...

How it works...

There's more...

See also

Creating SparseVector and setup with Spark

How to do it...

How it works...

There's more...

See also

Creating dense matrix and setup with Spark 2.0

Getting ready

How to do it...

How it works...

There's more...

See also

Using sparse local matrices with Spark 2.0

How to do it...

How it works...

There's more...

See also

Performing vector arithmetic using Spark 2.0

How to do it...

How it works...

There's more...

See also

Performing matrix arithmetic using Spark 2.0

How to do it...

How it works...

Exploring RowMatrix in Spark 2.0

How to do it...

How it works...

There's more...

See also

Exploring Distributed IndexedRowMatrix in Spark 2.0

How to do it...

How it works...

See also 

Exploring distributed CoordinateMatrix in Spark 2.0

How to do it...

How it works...

See also 

Exploring distributed BlockMatrix in Spark 2.0

How to do it...

How it works...

See also 

Spark's Three Data Musketeers for Machine Learning - Perfect Together

Introduction

RDDs - what started it all...

DataFrame - a natural evolution to unite API and SQL via a high-level API

Dataset - a high-level unifying Data API

Creating RDDs with Spark 2.0 using internal data sources

How to do it...

How it works...

Creating RDDs with Spark 2.0 using external data sources

How to do it...

How it works...

There's more...

See also

Transforming RDDs with Spark 2.0 using the filter() API

How to do it...

How it works...

There's more...

See also

Transforming RDDs with the super useful flatMap() API

How to do it...

How it works...

There's more...

See also

Transforming RDDs with set operation APIs

How to do it...

How it works...

See also

RDD transformation/aggregation with groupBy() and reduceByKey()

How to do it...

How it works...

There's more...

See also

Transforming RDDs with the zip() API

How to do it...

How it works...

See also

Join transformation with paired key-value RDDs

How to do it...

How it works...

There's more...

Reduce and grouping transformation with paired key-value RDDs

How to do it...

How it works...

See also

Creating DataFrames from Scala data structures

How to do it...

How it works...

There's more...

See also

Operating on DataFrames programmatically without SQL

How to do it...

How it works...

There's more...

See also

Loading DataFrames and setup from an external source

How to do it...

How it works...

There's more...

See also

Using DataFrames with standard SQL language - SparkSQL

How to do it...

How it works...

There's more...

See also

Working with the Dataset API using a Scala Sequence

How to do it...

How it works...

There's more...

See also

Creating and using Datasets from RDDs and back again

How to do it...

How it works...

There's more...

See also

Working with JSON using the Dataset API and SQL together

How to do it...

How it works...

There's more...

See also

Functional programming with the Dataset API using domain objects

How to do it...

How it works...

There's more...

See also

Common Recipes for Implementing a Robust Machine Learning System

Introduction

Spark's basic statistical API to help you build your own algorithms

How to do it...

How it works...

There's more...

See also

ML pipelines for real-life machine learning applications

How to do it...

How it works...

There's more...

See also

Normalizing data with Spark

How to do it...

How it works...

There's more...

See also

Splitting data for training and testing

How to do it...

How it works...

There's more...

See also

Common operations with the new Dataset API

How to do it...

How it works...

There's more...

See also

Creating and using RDD versus DataFrame versus Dataset from a text file in Spark 2.0

How to do it...

How it works...

There's more...

See also

LabeledPoint data structure for Spark ML

How to do it...

How it works...

There's more...

See also

Getting access to Spark cluster in Spark 2.0

How to do it...

How it works...

There's more...

See also

Getting access to Spark cluster pre-Spark 2.0

How to do it...

How it works...

There's more...

See also

Getting access to SparkContext vis-a-vis SparkSession object in Spark 2.0

How to do it...

How it works...

There's more...

See also

New model export and PMML markup in Spark 2.0

How to do it...

How it works...

There's more...

See also

Regression model evaluation using Spark 2.0

How to do it...

How it works...

There's more...

See also

Binary classification model evaluation using Spark 2.0

How to do it...

How it works...

There's more...

See also

Multiclass classification model evaluation using Spark 2.0

How to do it...

How it works...

There's more...

See also

Multilabel classification model evaluation using Spark 2.0

How to do it...

How it works...

There's more...

See also

Using the Scala Breeze library to do graphics in Spark 2.0

How to do it...

How it works...

There's more...

See also

Practical Machine Learning with Regression and Classification in Spark 2.0 - Part I

Introduction

Fitting a linear regression line to data the old fashioned way

How to do it...

How it works...

There's more...

See also

Generalized linear regression in Spark 2.0

How to do it...

How it works...

There's more...

See also

Linear regression API with Lasso and L-BFGS in Spark 2.0

How to do it...

How it works...

There's more...

See also

Linear regression API with Lasso and 'auto' optimization selection in Spark 2.0

How to do it...

How it works...

There's more...

See also

Linear regression API with ridge regression and 'auto' optimization selection in Spark 2.0

How to do it...

How it works...

There's more...

See also

Isotonic regression in Apache Spark 2.0

How to do it...

How it works...

There's more...

See also

Multilayer perceptron classifier in Apache Spark 2.0

How to do it...

How it works...

There's more...

See also

One-vs-Rest classifier (One-vs-All) in Apache Spark 2.0

How to do it...

How it works...

There's more...

See also

Survival regression – parametric AFT model in Apache Spark 2.0

How to do it...

How it works...

There's more...

See also

Practical Machine Learning with Regression and Classification in Spark 2.0 - Part II

Introduction

Linear regression with SGD optimization in Spark 2.0

How to do it...

How it works...

There's more...

See also

Logistic regression with SGD optimization in Spark 2.0

How to do it...

How it works...

There's more...

See also

Ridge regression with SGD optimization in Spark 2.0

How to do it...

How it works...

There's more...

See also

Lasso regression with SGD optimization in Spark 2.0

How to do it...

How it works...

There's more...

See also

Logistic regression with L-BFGS optimization in Spark 2.0

How to do it...

How it works...

There's more...

See also

Support Vector Machine (SVM) with Spark 2.0

How to do it...

How it works...

There's more...

See also

Naive Bayes machine learning with Spark 2.0 MLlib

How to do it...

How it works...

There's more...

See also

Exploring ML pipelines and DataFrames using logistic regression in Spark 2.0

Getting ready

How to do it...

How it works...

There's more...

PipeLine

Vectors

See also

Recommendation Engine that Scales with Spark

Introduction

Content filtering

Collaborative filtering

Neighborhood method

Latent factor models techniques

Setting up the required data for a scalable recommendation engine in Spark 2.0

How to do it...

How it works...

There's more...

See also

Exploring the movies data details for the recommendation system in Spark 2.0

How to do it...

How it works...

There's more...

See also

Exploring the ratings data details for the recommendation system in Spark 2.0

How to do it...

How it works...

There's more...

See also

Building a scalable recommendation engine using collaborative filtering in Spark 2.0

How to do it...

How it works...

There's more...

See also

Dealing with implicit input for training

Unsupervised Clustering with Apache Spark 2.0

Introduction

Building a KMeans classifying system in Spark 2.0

How to do it...

How it works...

KMeans (Lloyd Algorithm)

KMeans++ (Arthur's algorithm)

KMeans|| (pronounced as KMeans Parallel)

There's more...

See also

Bisecting KMeans, the new kid on the block in Spark 2.0

How to do it...

How it works...

There's more...

See also

Using Gaussian Mixture and Expectation Maximization (EM) in Spark to classify data

How to do it...

How it works...

New GaussianMixture()

There's more...

See also

Classifying the vertices of a graph using Power Iteration Clustering (PIC) in Spark 2.0

How to do it...

How it works...

There's more...

See also

Latent Dirichlet Allocation (LDA) to classify documents and text into topics

How to do it...

How it works...

There's more...

See also

Streaming KMeans to classify data in near real-time

How to do it...

How it works...

There's more...

See also

Optimization - Going Down the Hill with Gradient Descent

Introduction

How do machines learn using an error-based system?

Optimizing a quadratic cost function and finding the minima using just math to gain insight

How to do it...

How it works...

There's more...

See also

Coding a quadratic cost function optimization using Gradient Descent (GD) from scratch

How to do it...

How it works...

There's more...

See also

Coding Gradient Descent optimization to solve Linear Regression from scratch

How to do it...

How it works...

There's more...

See also

Normal equations as an alternative for solving Linear Regression in Spark 2.0

How to do it...

How it works...

There's more...

See also

Building Machine Learning Systems with Decision Tree and Ensemble Models

Introduction

Ensemble models

Measures of impurity

Getting and preparing real-world medical data for exploring Decision Trees and Ensemble models in Spark 2.0

How to do it...

There's more...

Building a classification system with Decision Trees in Spark 2.0

How to do it

How it works...

There's more...

See also

Solving Regression problems with Decision Trees in Spark 2.0

How to do it...

How it works...

See also

Building a classification system with Random Forest Trees in Spark 2.0

How to do it...

How it works...

See also

Solving regression problems with Random Forest Trees in Spark 2.0

How to do it...

How it works...

See also

Building a classification system with Gradient Boosted Trees (GBT) in Spark 2.0

How to do it...

How it works....

There's more...

See also

Solving regression problems with Gradient Boosted Trees (GBT) in Spark 2.0

How to do it...

How it works...

There's more...

See also

Curse of High-Dimensionality in Big Data

Introduction

Feature selection versus feature extraction

Two methods of ingesting and preparing a CSV file for processing in Spark

How to do it...

How it works...

There's more...

See also

Singular Value Decomposition (SVD) to reduce high-dimensionality in Spark

How to do it...

How it works...

There's more...

See also

Principal Component Analysis (PCA) to pick the most effective latent factor for machine learning in Spark

How to do it...

How it works...

There's more...

See also

Implementing Text Analytics with Spark 2.0 ML Library

Introduction

Doing term frequency with Spark - everything that counts

How to do it...

How it works...

There's more...

See also

Displaying similar words with Spark using Word2Vec

How to do it...

How it works...

There's more...

See also

Downloading a complete dump of Wikipedia for a real-life Spark ML project

How to do it...

There's more...

See also

Using Latent Semantic Analysis for text analytics with Spark 2.0

How to do it...

How it works...

There's more...

See also

Topic modeling with Latent Dirichlet allocation in Spark 2.0

How to do it...

How it works...

There's more...

See also

Spark Streaming and Machine Learning Library

Introduction

Structured streaming for near real-time machine learning

How to do it...

How it works...

There's more...

See also

Streaming DataFrames for real-time machine learning

How to do it...

How it works...

There's more...

See also

Streaming Datasets for real-time machine learning

How to do it...

How it works...

There's more...

See also

Streaming data and debugging with queueStream

How to do it...

How it works...

See also

Downloading and understanding the famous Iris data for unsupervised classification

How to do it...

How it works...

There's more...

See also

Streaming KMeans for a real-time on-line classifier

How to do it...

How it works...

There's more...

See also

Downloading wine quality data for streaming regression

How to do it...

How it works...

There's more...

Streaming linear regression for a real-time regression

How to do it...

How it works...

There's more...

See also

Downloading Pima Diabetes data for supervised classification

How to do it...

How it works...

There's more...

See also

Streaming logistic regression for an on-line classifier

How to do it...

How it works...

There's more...

See also

Preface

 

Education is not the learning of facts,
but the training of the mind to think.
- Albert Einstein

Data is the new silicon of our age, and machine learning, coupled with biologically inspired cognitive systems, serves as the core foundation to not only enable but also accelerate the birth of the fourth industrial revolution. This book is dedicated to our parents, who through extreme hardship and sacrifice, made our education possible and taught us to always practice kindness.

The Apache Spark 2.x Machine Learning Cookbook is crafted by four friends with diverse background, who bring in a vast experience across multiple industries and academic disciplines. The team has immense experience in the subject matter at hand. The book is as much about friendship as it is about the science underpinning Spark and Machine Learning. We wanted to put our thoughts together and write a book for the community that not only combines Spark’s ML code and real-world data sets but also provides context-relevant explanation, references, and readings for a deeper understanding and promoting further research. This book is a reflection of what our team would have wished to have when we got started with Apache Spark.

My own interest in machine learning and artificial intelligence started in the mid eighties when I had the opportunity to read two significant artifacts that happened to be listed back to back in Artificial Intelligence, An International Journal, Volume 28, Number 1, February 1986. While it has been a long journey for engineers and scientists of my generation, fortunately, the advancements in resilient distributed computing, cloud computing, GPUs, cognitive computing, optimization, and advanced machine learning have made the dream of long decades come true. All these advancements have become accessible for the current generation of ML enthusiasts and data scientists alike.

We live in one of the rarest periods in history--a time when multiple technological and sociological trends have merged at the same point in time. The elasticity of cloud computing with built-in access to ML and deep learning nets will provide a whole new set of opportunities to create and capture new markets. The emergence of Apache Spark as the lingua franca or the common language of near real-time resilient distributed computing and data virtualization has provided smart companies the opportunity to employ ML techniques at a scale without a heavy investment in specialized data centers or hardware.

The Apache Spark 2.x Machine Learning Cookbook is one of the most comprehensive treatments of the Apache Spark machine learning API, with selected subcomponents of Spark to give you the foundation you need before you can master a high-end career in machine learning and Apache Spark. The book is written with the goal of providing clarity and accessibility, and it reflects our own experience (including reading the source code) and learning curve with Apache Spark, which started with Spark 1.0.

The Apache Spark 2.x Machine Learning Cookbook lives at the intersection of Apache Spark, machine learning, and Scala for developers, and data scientists through a practitioner’s lens who not only has to understand the code but also the details, theory, and inner workings of a given Spark ML algorithm or API to establish a successful career in the new economy.

The book takes the cookbook format to a whole new level by blending downloadable ready-to-run Apache Spark ML code recipes with background, actionable theory, references, research, and real-life data sets to help the reader understand the what, how and the why behind the extensive facilities offered by Spark for the machine learning library. The book starts by laying the foundations needed to succeed and then rapidly evolves to cover all the meaningful ML algorithms available in Apache Spark.

What this book covers

Chapter 1, Practical Machine Learning with Spark Using Scala, covers installing and configuring a real-life development environment with machine learning and programming with Apache Spark. Using screenshots, it walks you through downloading, installing, and configuring Apache Spark and IntelliJ IDEA along with the necessary libraries that would reflect a developer’s desktop in a real-world setting. It then proceeds to identify and list over 40 data repositories with real-world data sets that can help the reader in experimenting and advancing even further with the code recipes. In the final step, we run our first ML program on Spark and then provide directions on how to add graphics to your machine learning programs, which are used in the subsequent chapters.

Chapter 2, Just Enough Linear Algebra for Machine Learning with Spark, covers the use of linear algebra (vector and matrix), which is the foundation of some of the most monumental works in machine learning. It provides a comprehensive treatment of the DenseVector, SparseVector, and matrix facilities available in Apache Spark, with the recipes in the chapter. It provides recipes for both local and distributed matrices, including RowMatrix, IndexedRowMatrix, CoordinateMatrix, and BlockMatrix to provide a detailed explanation of this topic. We included this chapter because mastery of the Spark and ML/MLlib was only possible by reading most of the source code line by line and understanding how the matrix decomposition and vector/matrix arithmetic work underneath the more course-grain algorithm in Spark. 

Chapter 3, Spark’s Three Data Musketeers for Machine Learning - Perfect Together, provides an end-to-end treatment of the three pillars of resilient distributed data manipulation and wrangling in Apache spark. The chapter comprises detailed recipes covering RDDs, DataFrame, and Dataset facilities from a practitioner’s point of view. Through an exhaustive list of 17 recipes, examples, references, and explanation, it lays out the foundation to build a successful career in machine learning sciences. The chapter provides both functional (code) as well as non-functional (SQL interface) programming approaches to solidify the knowledge base reflecting the real demands of a successful Spark ML engineer at tier 1 companies.

Chapter 4, Common Recipes for Implementing a Robust Machine Learning System, covers and factors out the tasks that are common in most machine learning systems through 16 short but to-the-point code recipes that the reader can use in their own real-world systems. It covers a gamut of techniques, ranging from normalizing data to evaluating the model output, using best practice metrics via Spark’s ML/MLlib facilities that might not be readily visible to the reader. It is a combination of recipes that we use in our day-to-day jobs in most situations but are listed separately to save on space and complexity of other recipes.

Chapter 5, Practical Machine Learning with Regression and Classification in Spark 2.0 - Part I, is the first of two chapters exploring classification and regression in Apache Spark. This chapter starts with Generalized Linear Regression (GLM) extending it to Lasso, Ridge with different types of optimization available in Spark. The chapter then proceeds to cover Isotonic regression, Survival regression with multi-layer perceptron (neural networks) and One-vs-Rest classifier.

Chapter 6, Practical Machine Learning with Regression and Classification in Spark 2.0 - Part II, is the second of the two regression and classification chapters. This chapter covers RDD-based regression systems, ranging from Linear, Logistic, and Ridge to Lasso, using Stochastic Gradient Decent and L_BFGS optimization in Spark. The last three recipes cover Support Vector Machine (SVM) and Naïve Bayes, ending with a detailed recipe for ML pipelines that are gaining a prominent position in the Spark ML ecosystem.

Chapter 7, Recommendation Engine that Scales with Spark, covers how to explore your data set and build a movie recommendation engine using Spark’s ML library facilities. It uses a large dataset and some recipes in addition to figures and write-ups to explore the various methods of recommenders before going deep into collaborative filtering techniques in Spark.

Chapter 8, Unsupervised Clustering with Apache Spark 2.0, covers the techniques used in unsupervised learning, such as KMeans, Mixture, and Expectation (EM), Power Iteration Clustering (PIC), and Latent Dirichlet Allocation (LDA), while also covering the why and how to help the reader to understand the core concepts. Using Spark Streaming, the chapter commences with a real-time KMeans clustering recipe to classify the input stream into labeled classes via unsupervised means.

Chapter 9, Optimization - Going Down the Hill with Gradient Descent, is a unique chapter that walks you through optimization as it applies to machine learning. It starts from a closed form formula and quadratic function optimization (for example, cost function), to using Gradient Descent (GD) in order to solve a regression problem from scratch. The chapter helps to look underneath the hood by developing the reader’s skill set using Scala code while providing in-depth explanation of how to code and understand Stochastic Descent (GD) from scratch. The chapter concludes with one of Spark’s ML API to achieve the same concepts that we code from scratch.

Chapter 10, Building Machine Learning Systems with Decision Tree and Ensemble Models, covers the Tree and Ensemble models for classification and regression in depth using Spark’s machine library. We use three real-world data sets to explore the classification and regression problems using Decision Tree, Random Forest Tree, and Gradient Boosted Tree. The chapter provides an in-depth explanation of these methods in addition to plug-and-play code recipes that explore Apache Spark’s machine library step by step.

Chapter 11, The Curse of High-Dimensionality in Big Data, demystifies the art and science of dimensionality reduction and provides a complete coverage of Spark’s ML/MLlib library, which facilitates this important concept in machine learning at scale. The chapter provides sufficient and in-depth coverage of the theory (the what and why) and then proceeds to cover two fundamental techniques available (the how) in Spark for the readers to use. The chapter covers Single Value Decomposition (SVD), which relates well with the second chapter and then proceeds to examine the Principal Component Analysis (PCA) in depth with code and write ups.

Chapter 12, Implementing Text Analytics with Spark 2.0 ML Library, covers the various techniques available in Spark for implementing text analytics at scale. It provides a comprehensive treatment by starting from the basics, such as Term Frequency (TF) and similarity techniques, such as Word2Vec, and moves on to analyzing a complete dump of Wikipedia for a real-life Spark ML project. The chapter concludes with an in-depth discussion and code for implementing Latent Semantic Analysis (LSA) and Topic Modeling with Latent Dirichlet Allocation (LDA) in Spark.

Chapter 13, Spark Streaming and Machine Learning Library, starts by providing an introduction to and the future direction of Spark streaming, and then proceeds to provide recipes for both RDD-based (DStream) and structured streaming to establish a baseline. The chapter then proceeds to cover all the available ML streaming algorithms in Spark at the time of writing this book. The chapter provides code and shows how to implement streaming DataFrame and streaming data sets, and then proceeds to cover queueStream for debugging before it goes into Streaming KMeans (unsupervised learning) and streaming linear models such as Linear and Logistic regression using real-world datasets.

What you need for this book

Please use the details from the software list document.

To execute the recipes in this book, you need a system running Windows 7 and above, or Mac 10, with the following software installed:

Apache Spark 2.x

Oracle JDK SE 1.8.x

JetBrain IntelliJ Community Edition 2016.2.X or later version

Scala plug-in for IntelliJ 2016.2.x

Jfreechart 1.0.19

breeze-core 0.12

Cloud9 1.5.0 JAR

Bliki-core 3.0.19

hadoop-streaming 2.2.0

Jcommon 1.0.23

Lucene-analyzers-common 6.0.0

Lucene-core-6.0.0

Spark-streaming-flume-assembly 2.0.0

Spark-streaming-kafka-assembly 2.0.0

The hardware requirements for this software are mentioned in the software list provided with the code bundle of this book.

Who this book is for

This book is for Scala developers with a fairly good exposure to and understanding of machine learning techniques, but who lack practical implementations with Spark. A solid knowledge of machine learning algorithms is assumed, as well as some hands-on experience of implementing ML algorithms with Scala. However, you do not need to be acquainted with the Spark ML libraries and the ecosystem.

Sections

In this book, you will find several headings that appear frequently (Getting ready, How to do it…, How it works…, There's more…, and See also). To give clear instructions on how to complete a recipe, we use these sections as follows:

Getting ready

This section tells you what to expect in the recipe, and describes how to set up any software or any preliminary settings required for the recipe.

How to do it…

This section contains the steps required to follow the recipe.

How it works…

This section usually consists of a detailed explanation of what happened in the previous section.

There's more…

This section consists of additional information about the recipe in order to make the reader more knowledgeable about the recipe.

See also

This section provides helpful links to other useful information for the recipe.

Conventions

In this book, you will find a number of text styles that distinguish between different kinds of information. Here are some examples of these styles and an explanation of their meaning. Code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles are shown as follows: "Mac users note that we installed Spark 2.0 in the /Users/USERNAME/spark/spark-2.0.0-bin-hadoop2.7/ directory on a Mac machine."

A block of code is set as follows:

object HelloWorld extends App { println("Hello World!") }

Any command-line input or output is written as follows:

mysql -u root -p

New terms and important words are shown in bold. Words that you see on the screen, for example, in menus or dialog boxes, appear in the text like this: "Configure Global Libraries. Select Scala SDK as your global library."

Warnings or important notes appear like this.
Tips and tricks appear like this.

Reader feedback

Feedback from our readers is always welcome. Let us know what you think about this book-what you liked or disliked. Reader feedback is important for us as it helps us develop titles that you will really get the most out of. To send us general feedback, simply e-mail [email protected], and mention the book's title in the subject of your message. If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, see our author guide at www.packtpub.com/authors.

Customer support

Now that you are the proud owner of a Packt book, we have a number of things to help you to get the most from your purchase.

Downloading the example code

You can download the example code files for this book from your account at http://www.packtpub.com. If you purchased this book elsewhere, you can visit http://www.packtpub.com/support and register to have the files e-mailed directly to you. You can download the code files by following these steps:

Log in or register to our website using your e-mail address and password.

Hover the mouse pointer on the

SUPPORT

tab at the top.

Click on

Code Downloads & Errata

.

Enter the name of the book in the

Search

box.

Select the book for which you're looking to download the code files.

Choose from the drop-down menu where you purchased this book from.

Click on

Code Download

.

You can also download the code files by clicking on the Code Files button on the book's webpage at the Packt Publishing website. This page can be accessed by entering the book's name in the Search box. Please note that you need to be logged in to your Packt account. Once the file is downloaded, please make sure that you unzip or extract the folder using the latest version of:

WinRAR / 7-Zip for Windows

Zipeg / iZip / UnRarX for Mac

7-Zip / PeaZip for Linux

The code bundle for the book is also hosted on GitHub at https://github.com/PacktPublishing/Apache-Spark-2x-Machine-Learning-Cookbook. We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/. Check them out!

Errata

Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you find a mistake in one of our books-maybe a mistake in the text or the code-we would be grateful if you could report this to us. By doing so, you can save other readers from frustration and help us improve subsequent versions of this book. If you find any errata, please report them by visiting http://www.packtpub.com/submit-errata, selecting your book, clicking on the Errata Submission Form link, and entering the details of your errata. Once your errata are verified, your submission will be accepted and the errata will be uploaded to our website or added to any list of existing errata under the Errata section of that title. To view the previously submitted errata, go to https://www.packtpub.com/books/content/support and enter the name of the book in the search field. The required information will appear under the Errata section.

Piracy

Piracy of copyrighted material on the Internet is an ongoing problem across all media. At Packt, we take the protection of our copyright and licenses very seriously. If you come across any illegal copies of our works in any form on the Internet, please provide us with the location address or website name immediately so that we can pursue a remedy. Please contact us at [email protected] with a link to the suspected pirated material. We appreciate your help in protecting our authors and our ability to bring you valuable content.

Questions

If you have a problem with any aspect of this book, you can contact us at [email protected], and we will do our best to address the problem.

Practical Machine Learning with Spark Using Scala

In this chapter, we will cover:

Downloading and installing the JDK

Downloading and installing IntelliJ

Downloading and installing Spark

Configuring IntelliJ to work with Spark and run Spark ML sample codes

Running a sample ML code from Spark

Identifying data sources for practical machine learning

Running your first program using Apache Spark 2.0 with the IntelliJ IDE

How to add graphics to your Spark program

Introduction

With the recent advancements in cluster computing coupled with the rise of big data, the field of machine learning has been pushed to the forefront of computing. The need for an interactive platform that enables data science at scale has long been a dream that is now a reality.

The following three areas together have enabled and accelerated interactive data science at scale:

Apache Spark

: A unified technology platform for data science that combines a fast compute engine and fault-tolerant data structures into a well-designed and integrated offering

Machine learning

: A field of artificial intelligence that enables machines to mimic some of the tasks originally reserved exclusively for the human brain

Scala

: A modern JVM-based language that builds on traditional languages, but unites functional and object-oriented concepts without the verboseness of other languages

First, we need to set up the development environment, which will consist of the following components:

Spark

IntelliJ community edition IDE

Scala

The recipes in this chapter will give you detailed instructions for installing and configuring the IntelliJ IDE, Scala plugin, and Spark. After the development environment is set up, we'll proceed to run one of the Spark ML sample codes to test the setup.

Apache Spark

Apache Spark is emerging as the de facto platform and trade language for big data analytics and as a complement to the Hadoop paradigm. Spark enables a data scientist to work in the manner that is most conducive to their workflow right out of the box. Spark's approach is to process the workload in a completely distributed manner without the need for MapReduce (MR) or repeated writing of the intermediate results to a disk.

Spark provides an easy-to-use distributed framework in a unified technology stack, which has made it the platform of choice for data science projects, which more often than not require an iterative algorithm that eventually merges toward a solution. These algorithms, due to their inner workings, generate a large amount of intermediate results that need to go from one stage to the next during the intermediate steps. The need for an interactive tool with a robust native distributed machine learning library (MLlib) rules out a disk-based approach for most of the data science projects.

Spark has a different approach toward cluster computing. It solves the problem as a technology stack rather than as an ecosystem. A large number of centrally managed libraries combined with a lightning-fast compute engine that can support fault-tolerant data structures has poised Spark to take over Hadoop as the preferred big data platform for analytics.

Spark has a modular approach, as depicted in the following diagram:

Machine learning

The aim of machine learning is to produce machines and devices that can mimic human intelligence and automate some of the tasks that have been traditionally reserved for a human brain. Machine learning algorithms are designed to go through very large data sets in a relatively short time and approximate answers that would have taken a human much longer to process.

The field of machine learning can be classified into many forms and at a high level, it can be classified as supervised and unsupervised learning. Supervised learning algorithms are a class of ML algorithms that use a training set (that is, labeled data) to compute a probabilistic distribution or graphical model that in turn allows them to classify the new data points without further human intervention. Unsupervised learning is a type of machine learning algorithm used to draw inferences from datasets consisting of input data without labeled responses.

Out of the box, Spark offers a rich set of ML algorithms that can be deployed on large datasets without any further coding. The following figure depicts Spark's MLlib algorithms as a mind map. Spark's MLlib is designed to take advantage of parallelism while having fault-tolerant distributed data structures. Spark refers to such data structures as Resilient Distributed Datasets or RDDs:

Scala

Scala is a modern programming language that is emerging as an alternative to traditional programming languages such as Java and C++. Scala is a JVM-based language that not only offers a concise syntax without the traditional boilerplate code, but also incorporates both object-oriented and functional programming into an extremely crisp and extraordinarily powerful type-safe language.

Scala takes a flexible and expressive approach, which makes it perfect for interacting with Spark's MLlib. The fact that Spark itself is written in Scala provides a strong evidence that the Scala language is a full-service programming language that can be used to create sophisticated system code with heavy performance needs.

Scala builds on Java's tradition by addressing some of its shortcomings, while avoiding an all-or-nothing approach. Scala code compiles into Java bytecode, which in turn makes it possible to coexist with rich Java libraries interchangeably. The ability to use Java libraries with Scala and vice versa provides continuity and a rich environment for software engineers to build modern and complex machine learning systems without being fully disconnected from the Java tradition and code base.

Scala fully supports a feature-rich functional programming paradigm with standard support for lambda, currying, type interface, immutability, lazy evaluation, and a pattern-matching paradigm reminiscent of Perl without the cryptic syntax. Scala is an excellent match for machine learning programming due to its support for algebra-friendly data types, anonymous functions, covariance, contra-variance, and higher-order functions.

Here's a hello world program in Scala:

object HelloWorld extends App { println("Hello World!") }

Compiling and running HelloWorld in Scala looks like this:

The Apache Spark Machine Learning Cookbook takes a practical approach by offering a multi-disciplinary view with the developer in mind. This book focuses on the interactions and cohesiveness of machine learning, Apache Spark, and Scala. We also take an extra step and teach you how to set up and run a comprehensive development environment familiar to a developer and provide code snippets that you have to run in an interactive shell without the modern facilities that an IDE provides:

Software versions and libraries used in this book

The following table provides a detailed list of software versions and libraries used in this book. If you follow the installation instructions covered in this chapter, it will include most of the items listed here. Any other JAR or library files that may be required for specific recipes are covered via additional installation instructions in the respective recipes:

Core systems

Version

Spark

2.0.0

Java

1.8

IntelliJ IDEA

2016.2.4

Scala-sdk

2.11.8

Miscellaneous JARs that will be required are as follows:

Miscellaneous JARs

Version

bliki-core

3.0.19

breeze-viz

0.12

Cloud9

1.5.0

Hadoop-streaming

2.2.0

JCommon

1.0.23

JFreeChart

1.0.19

lucene-analyzers-common

6.0.0

Lucene-Core

6.0.0

scopt

3.3.0

spark-streaming-flume-assembly

2.0.0

spark-streaming-kafka-0-8-assembly

2.0.0

 

We have additionally tested all the recipes in this book on Spark 2.1.1 and found that the programs executed as expected. It is recommended for learning purposes you use the software versions and libraries listed in these tables.

To stay current with the rapidly changing Spark landscape and documentation, the API links to the Spark documentation mentioned throughout this book point to the latest version of Spark 2.x.x, but the API references in the recipes are explicitly for Spark 2.0.0.

All the Spark documentation links provided in this book will point to the latest documentation on Spark's website. If you prefer to look for documentation for a specific version of Spark (for example, Spark 2.0.0), look for relevant documentation on the Spark website using the following URL:

https://spark.apache.org/documentation.html

We've made the code as simple as possible for clarity purposes rather than demonstrating the advanced features of Scala.

Downloading and installing the JDK

The first step is to download the JDK development environment that is required for Scala/Spark development.

Getting ready

When you are ready to download and install the JDK, access the following link:

http://www.oracle.com/technetwork/java/javase/downloads/index.html

How to do it...

After successful download, follow the on-screen instructions to install the JDK.

Downloading and installing IntelliJ

IntelliJ Community Edition is a lightweight IDE for Java SE, Groovy, Scala, and Kotlin development. To complete setting up your machine learning with the Spark development environment, the IntelliJ IDE needs to be installed.

Getting ready

When you are ready to download and install IntelliJ, access the following link:

https://www.jetbrains.com/idea/download/

How to do it...

At the time of writing, we are using IntelliJ version 15.x or later (for example, version 2016.2.4) to test the examples in the book, but feel free to download the latest version. Once the installation file is downloaded, double-click on the downloaded file (.exe) and begin to install the IDE. Leave all the installation options at the default settings if you do not want to make any changes. Follow the on-screen instructions to complete the installation:

Downloading and installing Spark

We now proceed to download and install Spark.

Getting ready

When you are ready to download and install Spark, access the Apache website at this link:

http://spark.apache.org/downloads.html

How to do it...

Go to the Apache website and select the required download parameters, as shown in this screenshot:

Make sure to accept the default choices (click on Next) and proceed with the installation.

Configuring IntelliJ to work with Spark and run Spark ML sample codes

We need to run some configurations to ensure that the project settings are correct before being able to run the samples that are provided by Spark or any of the programs listed this book.

Getting ready

We need to be particularly careful when configuring the project structure and global libraries. After we set everything up, we proceed to run the sample ML code provided by the Spark team to verify the setup. Sample code can be found under the Spark directory or can be obtained by downloading the Spark source code with samples.

How to do it...

The following are the steps for configuring IntelliJ to work with Spark MLlib and for running the sample ML code provided by Spark in the examples directory. The examples directory can be found in your home directory for Spark. Use the Scala samples to proceed:

Click on the

Project Structure...

option, as shown in the following screenshot, to configure project settings:

Verify the settings:

Configure

Global Libraries

. Select

Scala SDK

as your global library:

Select the JARs for the new Scala SDK and let the download complete:

Select the project name:

Verify the settings and additional libraries:

Add dependency JARs. Select modules under the

Project Settings

in the left-hand pane and click on dependencies to choose the required JARs, as shown in the following screenshot:

Select the JAR files provided by Spark. Choose Spark's default installation directory and then select the

lib

directory:

We then select the JAR files for examples that are provided for Spark out of the box.

Add required JARs by verifying that you selected and imported all the JARs listed under

External Libraries

in the the left-hand pane:

Spark 2.0 uses Scala 2.11. Two new streaming JARs, Flume and Kafka, are needed to run the examples, and can be downloaded from the following URLs:

https://repo1.maven.org/maven2/org/apache/spark/spark-streaming-flume-assembly_2.11/2.0.0/spark-streaming-flume-assembly_2.11-2.0.0.jar

https://repo1.maven.org/maven2/org/apache/spark/spark-streaming-kafka-0-8-assembly_2.11/2.0.0/spark-streaming-kafka-0-8-assembly_2.11-2.0.0.jar

The next step is to download and install the Flume and Kafka JARs. For the purposes of this book, we have used the Maven repo:

Download and install the Kafka assembly:

Download and install the Flume assembly:

After the download is complete, move the downloaded JAR files to the

lib

directory of Spark. We used the

C

drive when we installed Spark:

Open your IDE and verify that all the JARs under the

External Libraries

folder on the left, as shown in the following screenshot, are present in your setup:

Build the example projects in Spark to verify the setup:

Verify that the build was successful:

There's more...

Prior to Spark 2.0, we needed another library from Google called Guava for facilitating I/O and for providing a set of rich methods of defining tables and then letting Spark broadcast them across the cluster. Due to dependency issues that were hard to work around, Spark 2.0 no longer uses the Guava library. Make sure you use the Guava library if you are using Spark versions prior to 2.0 (required in version 1.5.2). The Guava library can be accessed at the following URL:

https://github.com/google/guava/wiki

You may want to use Guava version 15.0, which can be found here:

https://mvnrepository.com/artifact/com.google.guava/guava/15.0

If you are using installation instructions from previous blogs, make sure to exclude the Guava library from the installation set.

See also

If there are other third-party libraries or JARs required for the completion of the Spark installation, you can find those in the following Maven repository:

https://repo1.maven.org/maven2/org/apache/spark/

Running a sample ML code from Spark

We can verify the setup by simply downloading the sample code from the Spark source tree and importing it into IntelliJ to make sure it runs.

Getting ready

We will first run the logistic regression code from the samples to verify installation. In the next section, we proceed to write our own version of the same program and examine the output in order to understand how it works.

How to do it...

Go to the source directory and pick one of the ML sample code files to run. We've selected the logistic regression example.

If you cannot find the source code in your directory, you can always download the Spark source, unzip, and then extract the examples directory accordingly.

After selecting the example, select

Edit Configurations...

, as shown in the following screenshot:

In the

Configurations

tab, define the following options:

VM options

: The choice shown allows you to run a standalone Spark cluster

Program arguments

: What we are supposed to pass into the program

Run the logistic regression by going to

Run 'LogisticRegressionExample'

, as shown in the following screenshot:

Verify the exit code and make sure it is as shown in the following screenshot:

Identifying data sources for practical machine learning

Getting data for machine learning projects was a challenge in the past. However, now there is a rich set of public data sources specifically suitable for machine learning.

Getting ready

In addition to the university and government sources, there are many other open sources of data that can be used to learn and code your own examples and projects. We will list the data sources and show you how to best obtain and download data for each chapter.

How to do it...

The following is a list of open source data worth exploring if you would like to develop applications in this field:

UCI machine learning repository