PySpark Cookbook - Denny Lee - E-Book

PySpark Cookbook E-Book

Denny Lee

0,0
31,19 €

-100%
Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.

Mehr erfahren.
Beschreibung

Combine the power of Apache Spark and Python to build effective big data applications


Key FeaturesPerform effective data processing, machine learning, and analytics using PySparkOvercome challenges in developing and deploying Spark solutions using PythonExplore recipes for efficiently combining Python and Apache Spark to process dataBook Description


Apache Spark is an open source framework for efficient cluster computing with a strong interface for data parallelism and fault tolerance. The PySpark Cookbook presents effective and time-saving recipes for leveraging the power of Python and putting it to use in the Spark ecosystem.


You’ll start by learning the Apache Spark architecture and how to set up a Python environment for Spark. You’ll then get familiar with the modules available in PySpark and start using them effortlessly. In addition to this, you’ll discover how to abstract data with RDDs and DataFrames, and understand the streaming capabilities of PySpark. You’ll then move on to using ML and MLlib in order to solve any problems related to the machine learning capabilities of PySpark and use GraphFrames to solve graph-processing problems. Finally, you will explore how to deploy your applications to the cloud using the spark-submit command.


By the end of this book, you will be able to use the Python API for Apache Spark to solve any problems associated with building data-intensive applications.


What you will learnConfigure a local instance of PySpark in a virtual environment Install and configure Jupyter in local and multi-node environmentsCreate DataFrames from JSON and a dictionary using pyspark.sqlExplore regression and clustering models available in the ML moduleUse DataFrames to transform data used for modelingConnect to PubNub and perform aggregations on streamsWho this book is for


The PySpark Cookbook is for you if you are a Python developer looking for hands-on recipes for using the Apache Spark 2.x ecosystem in the best possible way. A thorough understanding of Python (and some familiarity with Spark) will help you get the best out of the book.


Denny Lee is a technology evangelist at Databricks. He is a hands-on data science engineer with 15+ years of experience. His key focuses are solving complex large-scale data problems—providing not only architectural direction but hands-on implementation of such systems. He has extensive experience of building greenfield teams as well as being a turnaround/change catalyst. Prior to joining Databricks, he was a senior director of data science engineering at Concur and was part of the incubation team that built Hadoop on Windows and Azure (currently known as HDInsight). Tomasz Drabas is a data scientist specializing in data mining, deep learning, machine learning, choice modeling, natural language processing, and operations research. He is the author of Learning PySpark and Practical Data Analysis Cookbook. He has a PhD from University of New South Wales, School of Aviation. His research areas are machine learning and choice modeling for airline revenue management.

Sie lesen das E-Book in den Legimi-Apps auf:

Android
iOS
von Legimi
zertifizierten E-Readern

Seitenzahl: 284

Veröffentlichungsjahr: 2018

Bewertungen
0,0
0
0
0
0
0
Mehr Informationen
Mehr Informationen
Legimi prüft nicht, ob Rezensionen von Nutzern stammen, die den betreffenden Titel tatsächlich gekauft oder gelesen/gehört haben. Wir entfernen aber gefälschte Rezensionen.



PySpark Cookbook

 

 

 

 

 

 

 

Over 60 recipes for implementing big data processing and analytics using Apache Spark and Python

 

 

 

 

 

 

 

 

 

Denny Lee
Tomasz Drabas

 

 

 

 

 

 

 

 

 

BIRMINGHAM - MUMBAI

PySpark Cookbook

Copyright © 2018 Packt Publishing

All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the authors, nor Packt Publishing or its dealers and distributors, will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book.

Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

Commissioning Editor: Amey VarangaonkarAcquisition Editor: Aman SinghContent Development Editor: Mayur PawanikarTechnical Editor: Dinesh PawarCopy Editor: Safis EditingProject Coordinator: Nidhi JoshiProofreader: Safis EditingIndexer:Mariammal ChettiyarGraphics: Tania DuttaProduction Coordinator: Shantanu Zagade

First published: June 2018

Production reference: 1280618

Published by Packt Publishing Ltd. Livery Place 35 Livery Street Birmingham B3 2PB, UK.

ISBN 978-1-78883-536-7

www.packtpub.com

mapt.io

Mapt is an online digital library that gives you full access to over 5,000 books and videos, as well as industry leading tools to help you plan your personal development and advance your career. For more information, please visit our website.

Why subscribe?

Spend less time learning and more time coding with practical eBooks and Videos from over 4,000 industry professionals

Improve your learning with Skill Plans built especially for you

Get a free eBook or video every month

Mapt is fully searchable

Copy and paste, print, and bookmark content

PacktPub.com

Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.PacktPub.com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at [email protected] for more details.

At www.PacktPub.com, you can also read a collection of free technical articles, sign up for a range of free newsletters, and receive exclusive discounts and offers on Packt books and eBooks.

Contributors

About the authors

Denny Lee is a technology evangelist at Databricks. He is a hands-on data science engineer with 15+ years of experience. His key focuses are solving complex large-scale data problems—providing not only architectural direction but hands-on implementation of such systems. He has extensive experience of building greenfield teams as well as being a turnaround/change catalyst. Prior to joining Databricks, he was a senior director of data science engineering at Concur and was part of the incubation team that built Hadoop on Windows and Azure (currently known as HDInsight).

 

 

 

Tomasz Drabas is a data scientist specializing in data mining, deep learning, machine learning, choice modeling, natural language processing, and operations research. He is the author of Learning PySpark and Practical Data Analysis Cookbook. He has a PhD from University of New South Wales, School of Aviation. His research areas are machine learning and choice modeling for airline revenue management.

About the reviewer

Sridhar Alla is a big data practitioner helping companies solve complex problems in distributed computing and implement large-scale data science and analytics practice. He presents regularly at several prestigious conferences and provides training and consulting to companies.  He loves writing code in Python, Scala, and Java. He has extensive hands-on knowledge of several Hadoop-based technologies, Spark, machine learning, deep learning and blockchain.

 

 

 

 

 

Packt is searching for authors like you

If you're interested in becoming an author for Packt, please visit authors.packtpub.com and apply today. We have worked with thousands of developers and tech professionals, just like you, to help them share their insight with the global tech community. You can make a general application, apply for a specific hot topic that we are recruiting an author for, or submit your own idea.

Table of Contents

Title Page

Copyright and Credits

PySpark Cookbook

Packt Upsell

Why subscribe?

PacktPub.com

Contributors

About the authors

About the reviewer

Packt is searching for authors like you

Preface

Who this book is for

What this book covers

To get the most out of this book

Download the example code files

Download the color images

Conventions used

Sections

Getting ready

How to do it...

How it works...

There's more...

See also

Get in touch

Reviews

Installing and Configuring Spark

Introduction

Installing Spark requirements

Getting ready

How to do it...

How it works...

There's more...

Installing Java

Installing Python

Installing R

Installing Scala

Installing Maven

Updating PATH

Installing Spark from sources

Getting ready

How to do it...

How it works...

There's more...

See also

Installing Spark from binaries

Getting ready

How to do it...

How it works...

There's more...

Configuring a local instance of Spark

Getting ready

How to do it...

How it works...

See also

Configuring a multi-node instance of Spark

Getting ready

How to do it...

How it works...

See also

Installing Jupyter

Getting ready

How to do it...

How it works...

There's more...

See also

Configuring a session in Jupyter

Getting ready

How to do it...

How it works...

There's more...

See also

Working with Cloudera Spark images

Getting ready

How to do it...

How it works...

Abstracting Data with RDDs

Introduction

Creating RDDs

Getting ready 

How to do it...

How it works...

Spark context parallelize method

.take(...) method

Reading data from files

Getting ready 

How to do it...

How it works...

.textFile(...) method

.map(...) method

Partitions and performance

Overview of RDD transformations

Getting ready

How to do it...

.map(...) transformation

.filter(...) transformation

.flatMap(...) transformation

.distinct() transformation

.sample(...) transformation

.join(...) transformation

.repartition(...) transformation

.zipWithIndex() transformation

.reduceByKey(...) transformation

.sortByKey(...) transformation

.union(...) transformation

.mapPartitionsWithIndex(...) transformation

How it works...

Overview of RDD actions

Getting ready

How to do it...

.take(...) action

.collect() action

.reduce(...) action

.count() action

.saveAsTextFile(...) action

How it works...

Pitfalls of using RDDs

Getting ready

How to do it...

How it works...

Abstracting Data with DataFrames

Introduction

Creating DataFrames

Getting ready

How to do it...

How it works...

There's more...

From JSON

From CSV

See also

Accessing underlying RDDs

Getting ready

How to do it...

How it works...

Performance optimizations

Getting ready

How to do it...

How it works...

There's more...

See also

Inferring the schema using reflection

Getting ready

How to do it...

How it works...

See also

Specifying the schema programmatically

Getting ready

How to do it...

How it works...

See also

Creating a temporary table

Getting ready

How to do it...

How it works...

There's more...

Using SQL to interact with DataFrames

Getting ready

How to do it...

How it works...

There's more...

Overview of DataFrame transformations

Getting ready

How to do it...

The .select(...) transformation

The .filter(...) transformation

The .groupBy(...) transformation

The .orderBy(...) transformation

The .withColumn(...) transformation

The .join(...) transformation

The .unionAll(...) transformation

The .distinct(...) transformation

The .repartition(...) transformation

The .fillna(...) transformation

The .dropna(...) transformation

The .dropDuplicates(...) transformation

The .summary() and .describe() transformations

The .freqItems(...) transformation

See also

Overview of DataFrame actions

Getting ready

How to do it...

The .show(...) action

The .collect() action

The .take(...) action

The .toPandas() action

See also

Preparing Data for Modeling

Introduction

Handling duplicates

Getting ready

How to do it...

How it works...

There's more...

Only IDs differ

ID collisions

Handling missing observations

Getting ready

How to do it...

How it works...

Missing observations per row

Missing observations per column

There's more...

See also

Handling outliers

Getting ready

How to do it...

How it works...

See also

Exploring descriptive statistics

Getting ready

How to do it...

How it works...

There's more...

Descriptive statistics for aggregated columns

See also

Computing correlations

Getting ready

How to do it...

How it works...

There's more...

Drawing histograms

Getting ready

How to do it...

How it works...

There's more...

See also

Visualizing interactions between features

Getting ready

How to do it...

How it works...

There's more...

Machine Learning with MLlib

Loading the data

Getting ready

How to do it...

How it works...

There's more...

Exploring the data

Getting ready

How to do it...

How it works...

Numerical features

Categorical features

There's more...

See also

Testing the data

Getting ready

How to do it...

How it works...

See also...

Transforming the data

Getting ready

How to do it...

How it works...

There's more...

See also...

Standardizing the data

Getting ready

How to do it...

How it works...

Creating an RDD for training

Getting ready

How to do it...

Classification

Regression

How it works...

There's more...

See also

Predicting hours of work for census respondents

Getting ready

How to do it...

How it works...

Forecasting the income levels of census respondents

Getting ready

How to do it...

How it works...

There's more...

Building a clustering models

Getting ready

How to do it...

How it works...

There's more...

See also

Computing performance statistics

Getting ready

How to do it...

How it works...

Regression metrics

Classification metrics

See also

Machine Learning with the ML Module

Introducing Transformers

Getting ready

How to do it...

How it works...

There's more...

See also

Introducing Estimators

Getting ready

How to do it...

How it works...

There's more...

Introducing Pipelines

Getting ready

How to do it...

How it works...

See also

Selecting the most predictable features

Getting ready

How to do it...

How it works...

There's more...

See also

Predicting forest coverage types

Getting ready

How to do it...

How it works...

There's more...

Estimating forest elevation

Getting ready

How to do it...

How it works...

There's more...

Clustering forest cover types

Getting ready

How to do it...

How it works...

See also

Tuning hyperparameters

Getting ready

How to do it...

How it works...

There's more...

Extracting features from text

Getting ready

How to do it...

How it works...

There's more...

See also

Discretizing continuous variables

Getting ready

How to do it...

How it works...

Standardizing continuous variables

Getting ready

How to do it...

How it works...

Topic mining

Getting ready

How to do it...

How it works...

Structured Streaming with PySpark

Introduction

Understanding Spark Streaming

Understanding DStreams

Getting ready

How to do it...

Terminal 1 – Netcat window

Terminal 2 – Spark Streaming window

How it works...

There's more...

Understanding global aggregations

Getting ready

How to do it...

Terminal 1 – Netcat window

Terminal 2 – Spark Streaming window

How it works...

Continuous aggregation with structured streaming

Getting ready

How to do it...

Terminal 1 – Netcat window

Terminal 2 – Spark Streaming window

How it works...

GraphFrames – Graph Theory with PySpark

Introduction

Installing GraphFrames

Getting ready

How to do it...

How it works...

Preparing the data

Getting ready

How to do it...

How it works...

There's more...

Building the graph

How to do it...

How it works...

Running queries against the graph

Getting ready

How to do it...

How it works...

Understanding the graph

Getting ready

How to do it...

How it works...

Using PageRank to determine airport ranking

Getting ready

How to do it...

How it works...

Finding the fewest number of connections

Getting ready

How to do it...

How it works...

There's more...

See also

Visualizing the graph

Getting ready

How to do it...

How it works...

Preface

Apache Spark is an open source framework for efficient cluster computing with a strong interface for data parallelism and fault tolerance. This book presents effective and time-saving recipes for leveraging the power of Python and putting it to use in the Spark ecosystem.

You'll start by learning about the Apache Spark architecture and seeing how to set up a Python environment for Spark. You'll then get familiar with the modules available in PySpark and start using them effortlessly. In addition to this, you'll discover how to abstract data with RDDs and DataFrames, and understand the streaming capabilities of PySpark. You'll then move on to using ML and MLlib in order to solve any problems related to the machine learning capabilities of PySpark, and you'll use GraphFrames to solve graph-processing problems. Finally, you will explore how to deploy your applications to the cloud using the spark-submit command.

By the end of this book, you will be able to use the Python API for Apache Spark to solve any problems associated with building data-intensive applications.

Who this book is for

This book is for you if you are a Python developer looking for hands-on recipes for using the Apache Spark 2.x ecosystem in the best possible way. A thorough understanding of Python (and some familiarity with Spark) will help you get the best out of the book.

What this book covers

Chapter 1, Installing and Configuring Spark, shows us how to install and configure Spark, either as a local instance, as a multi-node cluster, or in a virtual environment.

Chapter 2, Abstracting Data with RDDs, covers how to work with Apache Spark Resilient Distributed Datasets (RDDs).

Chapter 3, Abstracting Data with DataFrames, explores the current fundamental data structure—DataFrames.

Chapter 4, Preparing Data for Modeling, covers how to clean up your data and prepare it for modeling.

Chapter 5, Machine Learning with MLlib, shows how to build machine learning models with PySpark's MLlib module.

Chapter 6, Machine Learning with the ML Module, moves on to the currently supported machine learning module of PySpark—the ML module.

Chapter 7, Structured Streaming with PySpark, covers how to work with Apache Spark structured streaming within PySpark.

Chapter 8, GraphFrames – Graph Theory with PySpark, shows how to work with GraphFrames for Apache Spark.

To get the most out of this book

You need the following to smoothly work through the chapters:

Apache Spark (downloadable from

http://spark.apache.org/downloads.html

)

Python

Download the example code files

You can download the example code files for this book from your account at www.packtpub.com. If you purchased this book elsewhere, you can visit www.packtpub.com/support and register to have the files emailed directly to you.

You can download the code files by following these steps:

Log in or register at

www.packtpub.com

.

Select the

SUPPORT

tab.

Click on

Code Downloads & Errata

.

Enter the name of the book in the

Search

box and follow the onscreen instructions.

Once the file is downloaded, please make sure that you unzip or extract the folder using the latest version of:

WinRAR/7-Zip for Windows

Zipeg/iZip/UnRarX for Mac

7-Zip/PeaZip for Linux

The code bundle for the book is also hosted on GitHub at https://github.com/PacktPublishing/PySpark-Cookbook. In case there's an update to the code, it will be updated on the existing GitHub repository.

We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/. Check them out!

Download the color images

We also provide a PDF file that has color images of the screenshots/diagrams used in this book. You can download it here: https://www.packtpub.com/sites/default/files/downloads/PySparkCookbook_ColorImages.pdf.

Sections

In this book, you will find several headings that appear frequently (Getting ready, How to do it..., How it works..., There's more..., and See also).

To give clear instructions on how to complete a recipe, use these sections as follows:

Getting ready

This section tells you what to expect in the recipe and describes how to set up any software or any preliminary settings required for the recipe.

How to do it...

This section contains the steps required to follow the recipe.

How it works...

This section usually consists of a detailed explanation of what happened in the previous section.

There's more...

This section consists of additional information about the recipe in order to make you more knowledgeable about the recipe.

See also

This section provides helpful links to other useful information for the recipe.

Get in touch

Feedback from our readers is always welcome.

General feedback: Email [email protected] and mention the book title in the subject of your message. If you have questions about any aspect of this book, please email us at [email protected].

Errata: Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you have found a mistake in this book, we would be grateful if you would report this to us. Please visit www.packtpub.com/submit-errata, selecting your book, clicking on the Errata Submission Form link, and entering the details.

Piracy: If you come across any illegal copies of our works in any form on the internet, we would be grateful if you would provide us with the location address or website name. Please contact us at [email protected] with a link to the material.

If you are interested in becoming an author: If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, please visit authors.packtpub.com.

Reviews

Please leave a review. Once you have read and used this book, why not leave a review on the site that you purchased it from? Potential readers can then see and use your unbiased opinion to make purchase decisions, we at Packt can understand what you think about our products, and our authors can see your feedback on their book. Thank you!

For more information about Packt, please visit packtpub.com.

Installing and Configuring Spark

In this chapter, we will cover how to install and configure Spark, either as a local instance, a multi-node cluster, or in a virtual environment. You will learn the following recipes:

Installing Spark requirements

Installing Spark from sources

Installing Spark from binaries

Configuring a local instance of Spark

Configuring a multi-node instance of Spark

Installing Jupyter

Configuring a session in Jupyter

Working with Cloudera Spark images

Introduction

We cannot begin a book on Spark (well, on PySpark) without first specifying what Spark is. Spark is a powerful, flexible, open source, data processing and querying engine. It is extremely easy to use and provides the means to solve a huge variety of problems, ranging from processing unstructured, semi-structured, or structured data, through streaming, up to machine learning. With over 1,000 contributors from over 250 organizations (not to mention over 3,000 Spark Meetup community members worldwide), Spark is now one of the largest open source projects in the portfolio of the Apache Software Foundation.

The origins of Spark can be found in 2012 when it was first released; Matei Zacharia developed the first versions of the Spark processing engine at UC Berkeley as part of his PhD thesis. Since then, Spark has become extremely popular, and its popularity stems from a number of reasons:

It is fast

: It is estimated that Spark is 100 times faster than Hadoop when working purely in memory, and around 10 times faster when reading or writing data to a disk.

It is flexible

: You can leverage the power of Spark from a number of programming languages; Spark natively supports interfaces in Scala, Java, Python, and R. 

It is extendible

: As Spark is an open source package, you can easily extend it by introducing your own classes or extending the existing ones. 

It is powerful

: Many machine learning algorithms are already implemented in Spark so you do not need to add more tools to your stack—most of the data engineering and data science tasks can be accomplished while working in a single environment.

It is familiar

: Data scientists and data engineers, who are accustomed to using Python's

pandas

, or R's

data.frames

or

data.tables

, should have a much gentler learning curve (although the differences between these data types exist). Moreover, if you know SQL, you can also use it to wrangle data in Spark!

It is scalable

: Spark can run locally on your machine (with all the limitations such a solution entails). However, the same code that runs locally can be deployed to a cluster of thousands of machines with little-to-no changes. 

For the remainder of this book, we will assume that you are working in a Unix-like environment such as Linux (throughout this book, we will use Ubuntu Server 16.04 LTS) or macOS (running macOS High Sierra); all the code provided has been tested in these two environments. For this chapter (and some other ones, too), an internet connection is also required as we will be downloading a bunch of binaries and sources from the internet. 

We will not be focusing on installing Spark in a Windows environment as it is not truly supported by the Spark developers. However, if you are inclined to try, you can follow some of the instructions you will find online, such as from the following link: http://bit.ly/2Ar75ld.

Knowing how to use the command line and how to set some environment variables on your system is useful, but not really required—we will guide you through the steps.

Installing Spark requirements

Spark requires a handful of environments to be present on your machine before you can install and use it. In this recipe, we will focus on getting your machine ready for Spark installation.

Getting ready

To execute this recipe, you will need a bash Terminal and an internet connection. 

Also, before we start any work, you should clone the GitHub repository for this book. The repository contains all the codes (in the form of notebooks) and all the data you will need to follow the examples in this book. To clone the repository, go to http://bit.ly/2ArlBck, click on the Clone or download button, and copy the URL that shows up by clicking on the icon next to it:

Next, go to your Terminal and issue the following command:

git clone [email protected]:drabastomek/PySparkCookbook.git

If your git environment is set up properly, the whole GitHub repository should clone to your disk. No other prerequisites are required.

How it works...

First, we will specify all the required packages and their required minimum versions; looking at the preceding code, you can see that Spark 2.3.1 requires Java 1.8+ and Python 3.4 or higher (and we will always be checking for these two environments). Additionally, if you want to use R or Scala, the minimal requirements for these two packages are 3.1 and 2.11, respectively. Maven, as mentioned earlier, will be used to compile the Spark sources, and for doing that, Spark requires at least the 3.3.9 version of Maven.

You can check the Spark requirements here: https://spark.apache.org/docs/latest/index.html  You can check the requirements for building Spark here: https://spark.apache.org/docs/latest/building-spark.html.

Next, we parse the command-line arguments:

if [ "$_args_len" -ge 0 ]; then while [[ "$#" -gt 0 ]] do key="$1" case $key in -m|--Maven) _check_Maven_req="true" shift # past argument ;; -r|--R) _check_R_req="true" shift # past argument ;; -s|--Scala) _check_Scala_req="true" shift # past argument ;; *) shift # past argument esac donefi

You, as a user, can specify whether you want to check additionally for R, Scala, and Maven dependencies. To do so, run the following code from your command line (the following code will check for all of them):

./checkRequirements.sh -s -m -r

The following is also a perfectly valid usage:

./checkRequirements.sh --Scala --Maven --R

Next, we call three functions: printHeader, checkJava, and checkPython. The printHeader function is nothing more than just a simple way for the script to state what it does and it really is not that interesting, so we will skip it here; it is, however, fairly self-explanatory, so you are welcome to peruse the relevant portions of the checkRequirements.sh script yourself.

Next, we will check whether Java is installed. First, we just print to the Terminal that we are performing checks on Java (this is common across all of our functions, so we will only mention it here):

function checkJava() { echo echo "##########################" echo echo "Checking Java" echo

Following this, we will check if the Java environment is installed on your machine:

if type -p java; then echo "Java executable found in PATH" _java=javaelif [[ -n "$JAVA_HOME" ]] && [[ -x "$JAVA_HOME/bin/java" ]]; then echo "Found Java executable in JAVA_HOME" _java="$JAVA_HOME/bin/java"else echo "No Java found. Install Java version $_java_required or higher first or specify JAVA_HOME variable that will point to your Java binaries." exitfi

First, we use the type command to check if the java command is available; the type -p command returns the location of the java binary if it exists. This also implies that the bin folder containing Java binaries has been added to the PATH.

If you are certain you have the binaries installed (be it Java, Python, R, Scala, or Maven), you can jump to the Updating PATH section in this recipe to see how to let your computer know where these binaries live.

If this fails, we will revert to checking if the JAVA_HOME environment variable is set, and if it is, we will try to see if it contains the required java binary: [[ -x "$JAVA_HOME/bin/java" ]]. Should this fail, the program will print the message that no Java environment could be found and will exit (without checking for other required packages, like Python).

If, however, the Java binary is found, then we can check its version:

_java_version=$("$_java" -version 2>&1 | awk -F '"' '/version/ {print $2}')echo "Java version: $_java_version (min.: $_java_required)"if [[ "$_java_version" < "$_java_required" ]]; then echo "Java version required is $_java_required. Install the required version first." exitfi echo

 We first execute the java -version command in the Terminal, which would normally produce an output similar to the following screenshot:

We then pipe the previous output to awk to split (the -F switch) the rows at the quote '"' character (and will only use the first line of the output as we filter the rows down to those that contain /version/) and take the second (the $2) element as the version of the Java binaries installed on our machine. We will store it in the _java_version variable, which we also print to the screen using the echo command.

If you do not know what awk is or how to use it, we recommend this book from Packt: http://bit.ly/2BtTcBV.

Finally, we check if the _java_version we just obtained is lower than _java_required. If this evaluates to true, we will stop the execution, instead telling you to install the required version of Java. 

The logic implemented in the checkPython, checkR, checkScala, and checkMaven functions follows in a very similar way. The only differences are in what binary we call and in the way we check the versions:

For Python, we run 

"$_python" --version 2>&1 | awk -F ' ' '{print $2}'

, as checking the Python version (for Anaconda distribution) would print out the following to the screen: 

Python 3.5.2 :: Anaconda 2.4.1 (x86_64)

For R, we use 

"$_r" --version 2>&1 | awk -F ' ' '/R version/ {print $3}'

, as checking the R's version would write (a lot) to the screen; we only use the line that starts with

R version

R version 3.4.2 (2017-09-28) -- "Short Summer"

For Scala, we utilize 

"$_scala" -version 2>&1 | awk -F ' ' '{print $5}'

, given that checking Scala's version prints the following: 

Scala code runner version 2.11.8 -- Copyright 2002-2016, LAMP/EPFL

For Maven, we check 

"$_mvn" --version 2>&1 | awk -F ' ' '/Apache Maven/ {print $3}'

, as Maven prints out the following (and more!) when asked for its version: 

Apache Maven 3.5.2 (138edd61fd100ec658bfa2d307c43b76940a5d7d; 2017-10-18T00:58:13-07:00)

If you want to learn more, you should now be able to read the other functions with ease.

There's more...

If any of your dependencies are not installed, you need to install them before continuing with the next recipe. It goes beyond the scope of this book to guide you step-by-step through the installation process of all of these, but here are some helpful links to show you how to do it.

Installing Java

Installing Java is pretty straightforward.

On macOS, go to https://www.java.com/en/download/mac_download.jsp and download the version appropriate for your system. Once downloaded, follow the instructions to install it on your machine. If you require more detailed instructions, check this link: http://bit.ly/2idEozX.

On Linux, check the following link http://bit.ly/2jGwuz1 for Linux Java installation instructions. 

Installing Python

We have been using (and highly recommend) the Anaconda version of Python as it comes with the most commonly used packages included with the installer. It also comes built-in with the conda package management tool that makes installing other packages a breeze.

You can download Anaconda from http://www.continuum.io/downloads; select the appropriate version that will fulfill Spark's requirements. For macOS installation instructions, you can go to http://bit.ly/2zZPuUf and for a Linux installation manual check, you can go to http://bit.ly/2ASLUvg.

Installing R

R is distributed via Comprehensive R Archive Network (CRAN). The macOS version can be downloaded from here, https://cran.r-project.org/bin/macosx/, whereas the Linux one is available here: https://cran.r-project.org/bin/linux/.