31,19 €
Combine the power of Apache Spark and Python to build effective big data applications
Key FeaturesPerform effective data processing, machine learning, and analytics using PySparkOvercome challenges in developing and deploying Spark solutions using PythonExplore recipes for efficiently combining Python and Apache Spark to process dataBook Description
Apache Spark is an open source framework for efficient cluster computing with a strong interface for data parallelism and fault tolerance. The PySpark Cookbook presents effective and time-saving recipes for leveraging the power of Python and putting it to use in the Spark ecosystem.
You’ll start by learning the Apache Spark architecture and how to set up a Python environment for Spark. You’ll then get familiar with the modules available in PySpark and start using them effortlessly. In addition to this, you’ll discover how to abstract data with RDDs and DataFrames, and understand the streaming capabilities of PySpark. You’ll then move on to using ML and MLlib in order to solve any problems related to the machine learning capabilities of PySpark and use GraphFrames to solve graph-processing problems. Finally, you will explore how to deploy your applications to the cloud using the spark-submit command.
By the end of this book, you will be able to use the Python API for Apache Spark to solve any problems associated with building data-intensive applications.
What you will learnConfigure a local instance of PySpark in a virtual environment Install and configure Jupyter in local and multi-node environmentsCreate DataFrames from JSON and a dictionary using pyspark.sqlExplore regression and clustering models available in the ML moduleUse DataFrames to transform data used for modelingConnect to PubNub and perform aggregations on streamsWho this book is for
The PySpark Cookbook is for you if you are a Python developer looking for hands-on recipes for using the Apache Spark 2.x ecosystem in the best possible way. A thorough understanding of Python (and some familiarity with Spark) will help you get the best out of the book.
Denny Lee is a technology evangelist at Databricks. He is a hands-on data science engineer with 15+ years of experience. His key focuses are solving complex large-scale data problems—providing not only architectural direction but hands-on implementation of such systems. He has extensive experience of building greenfield teams as well as being a turnaround/change catalyst. Prior to joining Databricks, he was a senior director of data science engineering at Concur and was part of the incubation team that built Hadoop on Windows and Azure (currently known as HDInsight). Tomasz Drabas is a data scientist specializing in data mining, deep learning, machine learning, choice modeling, natural language processing, and operations research. He is the author of Learning PySpark and Practical Data Analysis Cookbook. He has a PhD from University of New South Wales, School of Aviation. His research areas are machine learning and choice modeling for airline revenue management.
Sie lesen das E-Book in den Legimi-Apps auf:
Seitenzahl: 284
Veröffentlichungsjahr: 2018
Copyright © 2018 Packt Publishing
All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.
Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the authors, nor Packt Publishing or its dealers and distributors, will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book.
Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.
Commissioning Editor: Amey VarangaonkarAcquisition Editor: Aman SinghContent Development Editor: Mayur PawanikarTechnical Editor: Dinesh PawarCopy Editor: Safis EditingProject Coordinator: Nidhi JoshiProofreader: Safis EditingIndexer:Mariammal ChettiyarGraphics: Tania DuttaProduction Coordinator: Shantanu Zagade
First published: June 2018
Production reference: 1280618
Published by Packt Publishing Ltd. Livery Place 35 Livery Street Birmingham B3 2PB, UK.
ISBN 978-1-78883-536-7
www.packtpub.com
Mapt is an online digital library that gives you full access to over 5,000 books and videos, as well as industry leading tools to help you plan your personal development and advance your career. For more information, please visit our website.
Spend less time learning and more time coding with practical eBooks and Videos from over 4,000 industry professionals
Improve your learning with Skill Plans built especially for you
Get a free eBook or video every month
Mapt is fully searchable
Copy and paste, print, and bookmark content
Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.PacktPub.com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at [email protected] for more details.
At www.PacktPub.com, you can also read a collection of free technical articles, sign up for a range of free newsletters, and receive exclusive discounts and offers on Packt books and eBooks.
Denny Lee is a technology evangelist at Databricks. He is a hands-on data science engineer with 15+ years of experience. His key focuses are solving complex large-scale data problems—providing not only architectural direction but hands-on implementation of such systems. He has extensive experience of building greenfield teams as well as being a turnaround/change catalyst. Prior to joining Databricks, he was a senior director of data science engineering at Concur and was part of the incubation team that built Hadoop on Windows and Azure (currently known as HDInsight).
Tomasz Drabas is a data scientist specializing in data mining, deep learning, machine learning, choice modeling, natural language processing, and operations research. He is the author of Learning PySpark and Practical Data Analysis Cookbook. He has a PhD from University of New South Wales, School of Aviation. His research areas are machine learning and choice modeling for airline revenue management.
Sridhar Alla is a big data practitioner helping companies solve complex problems in distributed computing and implement large-scale data science and analytics practice. He presents regularly at several prestigious conferences and provides training and consulting to companies. He loves writing code in Python, Scala, and Java. He has extensive hands-on knowledge of several Hadoop-based technologies, Spark, machine learning, deep learning and blockchain.
If you're interested in becoming an author for Packt, please visit authors.packtpub.com and apply today. We have worked with thousands of developers and tech professionals, just like you, to help them share their insight with the global tech community. You can make a general application, apply for a specific hot topic that we are recruiting an author for, or submit your own idea.
Title Page
Copyright and Credits
PySpark Cookbook
Packt Upsell
Why subscribe?
PacktPub.com
Contributors
About the authors
About the reviewer
Packt is searching for authors like you
Preface
Who this book is for
What this book covers
To get the most out of this book
Download the example code files
Download the color images
Conventions used
Sections
Getting ready
How to do it...
How it works...
There's more...
See also
Get in touch
Reviews
Installing and Configuring Spark
Introduction
Installing Spark requirements
Getting ready
How to do it...
How it works...
There's more...
Installing Java
Installing Python
Installing R
Installing Scala
Installing Maven
Updating PATH
Installing Spark from sources
Getting ready
How to do it...
How it works...
There's more...
See also
Installing Spark from binaries
Getting ready
How to do it...
How it works...
There's more...
Configuring a local instance of Spark
Getting ready
How to do it...
How it works...
See also
Configuring a multi-node instance of Spark
Getting ready
How to do it...
How it works...
See also
Installing Jupyter
Getting ready
How to do it...
How it works...
There's more...
See also
Configuring a session in Jupyter
Getting ready
How to do it...
How it works...
There's more...
See also
Working with Cloudera Spark images
Getting ready
How to do it...
How it works...
Abstracting Data with RDDs
Introduction
Creating RDDs
Getting ready 
How to do it...
How it works...
Spark context parallelize method
.take(...) method
Reading data from files
Getting ready 
How to do it...
How it works...
.textFile(...) method
.map(...) method
Partitions and performance
Overview of RDD transformations
Getting ready
How to do it...
.map(...) transformation
.filter(...) transformation
.flatMap(...) transformation
.distinct() transformation
.sample(...) transformation
.join(...) transformation
.repartition(...) transformation
.zipWithIndex() transformation
.reduceByKey(...) transformation
.sortByKey(...) transformation
.union(...) transformation
.mapPartitionsWithIndex(...) transformation
How it works...
Overview of RDD actions
Getting ready
How to do it...
.take(...) action
.collect() action
.reduce(...) action
.count() action
.saveAsTextFile(...) action
How it works...
Pitfalls of using RDDs
Getting ready
How to do it...
How it works...
Abstracting Data with DataFrames
Introduction
Creating DataFrames
Getting ready
How to do it...
How it works...
There's more...
From JSON
From CSV
See also
Accessing underlying RDDs
Getting ready
How to do it...
How it works...
Performance optimizations
Getting ready
How to do it...
How it works...
There's more...
See also
Inferring the schema using reflection
Getting ready
How to do it...
How it works...
See also
Specifying the schema programmatically
Getting ready
How to do it...
How it works...
See also
Creating a temporary table
Getting ready
How to do it...
How it works...
There's more...
Using SQL to interact with DataFrames
Getting ready
How to do it...
How it works...
There's more...
Overview of DataFrame transformations
Getting ready
How to do it...
The .select(...) transformation
The .filter(...) transformation
The .groupBy(...) transformation
The .orderBy(...) transformation
The .withColumn(...) transformation
The .join(...) transformation
The .unionAll(...) transformation
The .distinct(...) transformation
The .repartition(...) transformation
The .fillna(...) transformation
The .dropna(...) transformation
The .dropDuplicates(...) transformation
The .summary() and .describe() transformations
The .freqItems(...) transformation
See also
Overview of DataFrame actions
Getting ready
How to do it...
The .show(...) action
The .collect() action
The .take(...) action
The .toPandas() action
See also
Preparing Data for Modeling
Introduction
Handling duplicates
Getting ready
How to do it...
How it works...
There's more...
Only IDs differ
ID collisions
Handling missing observations
Getting ready
How to do it...
How it works...
Missing observations per row
Missing observations per column
There's more...
See also
Handling outliers
Getting ready
How to do it...
How it works...
See also
Exploring descriptive statistics
Getting ready
How to do it...
How it works...
There's more...
Descriptive statistics for aggregated columns
See also
Computing correlations
Getting ready
How to do it...
How it works...
There's more...
Drawing histograms
Getting ready
How to do it...
How it works...
There's more...
See also
Visualizing interactions between features
Getting ready
How to do it...
How it works...
There's more...
Machine Learning with MLlib
Loading the data
Getting ready
How to do it...
How it works...
There's more...
Exploring the data
Getting ready
How to do it...
How it works...
Numerical features
Categorical features
There's more...
See also
Testing the data
Getting ready
How to do it...
How it works...
See also...
Transforming the data
Getting ready
How to do it...
How it works...
There's more...
See also...
Standardizing the data
Getting ready
How to do it...
How it works...
Creating an RDD for training
Getting ready
How to do it...
Classification
Regression
How it works...
There's more...
See also
Predicting hours of work for census respondents
Getting ready
How to do it...
How it works...
Forecasting the income levels of census respondents
Getting ready
How to do it...
How it works...
There's more...
Building a clustering models
Getting ready
How to do it...
How it works...
There's more...
See also
Computing performance statistics
Getting ready
How to do it...
How it works...
Regression metrics
Classification metrics
See also
Machine Learning with the ML Module
Introducing Transformers
Getting ready
How to do it...
How it works...
There's more...
See also
Introducing Estimators
Getting ready
How to do it...
How it works...
There's more...
Introducing Pipelines
Getting ready
How to do it...
How it works...
See also
Selecting the most predictable features
Getting ready
How to do it...
How it works...
There's more...
See also
Predicting forest coverage types
Getting ready
How to do it...
How it works...
There's more...
Estimating forest elevation
Getting ready
How to do it...
How it works...
There's more...
Clustering forest cover types
Getting ready
How to do it...
How it works...
See also
Tuning hyperparameters
Getting ready
How to do it...
How it works...
There's more...
Extracting features from text
Getting ready
How to do it...
How it works...
There's more...
See also
Discretizing continuous variables
Getting ready
How to do it...
How it works...
Standardizing continuous variables
Getting ready
How to do it...
How it works...
Topic mining
Getting ready
How to do it...
How it works...
Structured Streaming with PySpark
Introduction
Understanding Spark Streaming
Understanding DStreams
Getting ready
How to do it...
Terminal 1 – Netcat window
Terminal 2 – Spark Streaming window
How it works...
There's more...
Understanding global aggregations
Getting ready
How to do it...
Terminal 1 – Netcat window
Terminal 2 – Spark Streaming window
How it works...
Continuous aggregation with structured streaming
Getting ready
How to do it...
Terminal 1 – Netcat window
Terminal 2 – Spark Streaming window
How it works...
GraphFrames – Graph Theory with PySpark
Introduction
Installing GraphFrames
Getting ready
How to do it...
How it works...
Preparing the data
Getting ready
How to do it...
How it works...
There's more...
Building the graph
How to do it...
How it works...
Running queries against the graph
Getting ready
How to do it...
How it works...
Understanding the graph
Getting ready
How to do it...
How it works...
Using PageRank to determine airport ranking
Getting ready
How to do it...
How it works...
Finding the fewest number of connections
Getting ready
How to do it...
How it works...
There's more...
See also
Visualizing the graph
Getting ready
How to do it...
How it works...
Apache Spark is an open source framework for efficient cluster computing with a strong interface for data parallelism and fault tolerance. This book presents effective and time-saving recipes for leveraging the power of Python and putting it to use in the Spark ecosystem.
You'll start by learning about the Apache Spark architecture and seeing how to set up a Python environment for Spark. You'll then get familiar with the modules available in PySpark and start using them effortlessly. In addition to this, you'll discover how to abstract data with RDDs and DataFrames, and understand the streaming capabilities of PySpark. You'll then move on to using ML and MLlib in order to solve any problems related to the machine learning capabilities of PySpark, and you'll use GraphFrames to solve graph-processing problems. Finally, you will explore how to deploy your applications to the cloud using the spark-submit command.
By the end of this book, you will be able to use the Python API for Apache Spark to solve any problems associated with building data-intensive applications.
This book is for you if you are a Python developer looking for hands-on recipes for using the Apache Spark 2.x ecosystem in the best possible way. A thorough understanding of Python (and some familiarity with Spark) will help you get the best out of the book.
Chapter 1, Installing and Configuring Spark, shows us how to install and configure Spark, either as a local instance, as a multi-node cluster, or in a virtual environment.
Chapter 2, Abstracting Data with RDDs, covers how to work with Apache Spark Resilient Distributed Datasets (RDDs).
Chapter 3, Abstracting Data with DataFrames, explores the current fundamental data structure—DataFrames.
Chapter 4, Preparing Data for Modeling, covers how to clean up your data and prepare it for modeling.
Chapter 5, Machine Learning with MLlib, shows how to build machine learning models with PySpark's MLlib module.
Chapter 6, Machine Learning with the ML Module, moves on to the currently supported machine learning module of PySpark—the ML module.
Chapter 7, Structured Streaming with PySpark, covers how to work with Apache Spark structured streaming within PySpark.
Chapter 8, GraphFrames – Graph Theory with PySpark, shows how to work with GraphFrames for Apache Spark.
You need the following to smoothly work through the chapters:
Apache Spark (downloadable from
http://spark.apache.org/downloads.html
)
Python
You can download the example code files for this book from your account at www.packtpub.com. If you purchased this book elsewhere, you can visit www.packtpub.com/support and register to have the files emailed directly to you.
You can download the code files by following these steps:
Log in or register at
www.packtpub.com
.
Select the
SUPPORT
tab.
Click on
Code Downloads & Errata
.
Enter the name of the book in the
Search
box and follow the onscreen instructions.
Once the file is downloaded, please make sure that you unzip or extract the folder using the latest version of:
WinRAR/7-Zip for Windows
Zipeg/iZip/UnRarX for Mac
7-Zip/PeaZip for Linux
The code bundle for the book is also hosted on GitHub at https://github.com/PacktPublishing/PySpark-Cookbook. In case there's an update to the code, it will be updated on the existing GitHub repository.
We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/. Check them out!
We also provide a PDF file that has color images of the screenshots/diagrams used in this book. You can download it here: https://www.packtpub.com/sites/default/files/downloads/PySparkCookbook_ColorImages.pdf.
In this book, you will find several headings that appear frequently (Getting ready, How to do it..., How it works..., There's more..., and See also).
To give clear instructions on how to complete a recipe, use these sections as follows:
This section tells you what to expect in the recipe and describes how to set up any software or any preliminary settings required for the recipe.
This section contains the steps required to follow the recipe.
This section usually consists of a detailed explanation of what happened in the previous section.
This section consists of additional information about the recipe in order to make you more knowledgeable about the recipe.
This section provides helpful links to other useful information for the recipe.
Feedback from our readers is always welcome.
General feedback: Email [email protected] and mention the book title in the subject of your message. If you have questions about any aspect of this book, please email us at [email protected].
Errata: Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you have found a mistake in this book, we would be grateful if you would report this to us. Please visit www.packtpub.com/submit-errata, selecting your book, clicking on the Errata Submission Form link, and entering the details.
Piracy: If you come across any illegal copies of our works in any form on the internet, we would be grateful if you would provide us with the location address or website name. Please contact us at [email protected] with a link to the material.
If you are interested in becoming an author: If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, please visit authors.packtpub.com.
Please leave a review. Once you have read and used this book, why not leave a review on the site that you purchased it from? Potential readers can then see and use your unbiased opinion to make purchase decisions, we at Packt can understand what you think about our products, and our authors can see your feedback on their book. Thank you!
For more information about Packt, please visit packtpub.com.
In this chapter, we will cover how to install and configure Spark, either as a local instance, a multi-node cluster, or in a virtual environment. You will learn the following recipes:
Installing Spark requirements
Installing Spark from sources
Installing Spark from binaries
Configuring a local instance of Spark
Configuring a multi-node instance of Spark
Installing Jupyter
Configuring a session in Jupyter
Working with Cloudera Spark images
We cannot begin a book on Spark (well, on PySpark) without first specifying what Spark is. Spark is a powerful, flexible, open source, data processing and querying engine. It is extremely easy to use and provides the means to solve a huge variety of problems, ranging from processing unstructured, semi-structured, or structured data, through streaming, up to machine learning. With over 1,000 contributors from over 250 organizations (not to mention over 3,000 Spark Meetup community members worldwide), Spark is now one of the largest open source projects in the portfolio of the Apache Software Foundation.
The origins of Spark can be found in 2012 when it was first released; Matei Zacharia developed the first versions of the Spark processing engine at UC Berkeley as part of his PhD thesis. Since then, Spark has become extremely popular, and its popularity stems from a number of reasons:
It is fast
: It is estimated that Spark is 100 times faster than Hadoop when working purely in memory, and around 10 times faster when reading or writing data to a disk.
It is flexible
: You can leverage the power of Spark from a number of programming languages; Spark natively supports interfaces in Scala, Java, Python, and R.
It is extendible
: As Spark is an open source package, you can easily extend it by introducing your own classes or extending the existing ones.
It is powerful
: Many machine learning algorithms are already implemented in Spark so you do not need to add more tools to your stack—most of the data engineering and data science tasks can be accomplished while working in a single environment.
It is familiar
: Data scientists and data engineers, who are accustomed to using Python's
pandas
, or R's
data.frames
or
data.tables
, should have a much gentler learning curve (although the differences between these data types exist). Moreover, if you know SQL, you can also use it to wrangle data in Spark!
It is scalable
: Spark can run locally on your machine (with all the limitations such a solution entails). However, the same code that runs locally can be deployed to a cluster of thousands of machines with little-to-no changes.
For the remainder of this book, we will assume that you are working in a Unix-like environment such as Linux (throughout this book, we will use Ubuntu Server 16.04 LTS) or macOS (running macOS High Sierra); all the code provided has been tested in these two environments. For this chapter (and some other ones, too), an internet connection is also required as we will be downloading a bunch of binaries and sources from the internet.
Knowing how to use the command line and how to set some environment variables on your system is useful, but not really required—we will guide you through the steps.
Spark requires a handful of environments to be present on your machine before you can install and use it. In this recipe, we will focus on getting your machine ready for Spark installation.
To execute this recipe, you will need a bash Terminal and an internet connection.
Also, before we start any work, you should clone the GitHub repository for this book. The repository contains all the codes (in the form of notebooks) and all the data you will need to follow the examples in this book. To clone the repository, go to http://bit.ly/2ArlBck, click on the Clone or download button, and copy the URL that shows up by clicking on the icon next to it:
Next, go to your Terminal and issue the following command:
git clone [email protected]:drabastomek/PySparkCookbook.git
If your git environment is set up properly, the whole GitHub repository should clone to your disk. No other prerequisites are required.
First, we will specify all the required packages and their required minimum versions; looking at the preceding code, you can see that Spark 2.3.1 requires Java 1.8+ and Python 3.4 or higher (and we will always be checking for these two environments). Additionally, if you want to use R or Scala, the minimal requirements for these two packages are 3.1 and 2.11, respectively. Maven, as mentioned earlier, will be used to compile the Spark sources, and for doing that, Spark requires at least the 3.3.9 version of Maven.
Next, we parse the command-line arguments:
if [ "$_args_len" -ge 0 ]; then while [[ "$#" -gt 0 ]] do key="$1" case $key in -m|--Maven) _check_Maven_req="true" shift # past argument ;; -r|--R) _check_R_req="true" shift # past argument ;; -s|--Scala) _check_Scala_req="true" shift # past argument ;; *) shift # past argument esac donefi
You, as a user, can specify whether you want to check additionally for R, Scala, and Maven dependencies. To do so, run the following code from your command line (the following code will check for all of them):
./checkRequirements.sh -s -m -r
The following is also a perfectly valid usage:
./checkRequirements.sh --Scala --Maven --R
Next, we call three functions: printHeader, checkJava, and checkPython. The printHeader function is nothing more than just a simple way for the script to state what it does and it really is not that interesting, so we will skip it here; it is, however, fairly self-explanatory, so you are welcome to peruse the relevant portions of the checkRequirements.sh script yourself.
Next, we will check whether Java is installed. First, we just print to the Terminal that we are performing checks on Java (this is common across all of our functions, so we will only mention it here):
function checkJava() { echo echo "##########################" echo echo "Checking Java" echo
Following this, we will check if the Java environment is installed on your machine:
if type -p java; then echo "Java executable found in PATH" _java=javaelif [[ -n "$JAVA_HOME" ]] && [[ -x "$JAVA_HOME/bin/java" ]]; then echo "Found Java executable in JAVA_HOME" _java="$JAVA_HOME/bin/java"else echo "No Java found. Install Java version $_java_required or higher first or specify JAVA_HOME variable that will point to your Java binaries." exitfi
First, we use the type command to check if the java command is available; the type -p command returns the location of the java binary if it exists. This also implies that the bin folder containing Java binaries has been added to the PATH.
If this fails, we will revert to checking if the JAVA_HOME environment variable is set, and if it is, we will try to see if it contains the required java binary: [[ -x "$JAVA_HOME/bin/java" ]]. Should this fail, the program will print the message that no Java environment could be found and will exit (without checking for other required packages, like Python).
If, however, the Java binary is found, then we can check its version:
_java_version=$("$_java" -version 2>&1 | awk -F '"' '/version/ {print $2}')echo "Java version: $_java_version (min.: $_java_required)"if [[ "$_java_version" < "$_java_required" ]]; then echo "Java version required is $_java_required. Install the required version first." exitfi echo
We first execute the java -version command in the Terminal, which would normally produce an output similar to the following screenshot:
We then pipe the previous output to awk to split (the -F switch) the rows at the quote '"' character (and will only use the first line of the output as we filter the rows down to those that contain /version/) and take the second (the $2) element as the version of the Java binaries installed on our machine. We will store it in the _java_version variable, which we also print to the screen using the echo command.
Finally, we check if the _java_version we just obtained is lower than _java_required. If this evaluates to true, we will stop the execution, instead telling you to install the required version of Java.
The logic implemented in the checkPython, checkR, checkScala, and checkMaven functions follows in a very similar way. The only differences are in what binary we call and in the way we check the versions:
For Python, we run
"$_python" --version 2>&1 | awk -F ' ' '{print $2}'
, as checking the Python version (for Anaconda distribution) would print out the following to the screen:
Python 3.5.2 :: Anaconda 2.4.1 (x86_64)
For R, we use
"$_r" --version 2>&1 | awk -F ' ' '/R version/ {print $3}'
, as checking the R's version would write (a lot) to the screen; we only use the line that starts with
R version
:
R version 3.4.2 (2017-09-28) -- "Short Summer"
For Scala, we utilize
"$_scala" -version 2>&1 | awk -F ' ' '{print $5}'
, given that checking Scala's version prints the following:
Scala code runner version 2.11.8 -- Copyright 2002-2016, LAMP/EPFL
For Maven, we check
"$_mvn" --version 2>&1 | awk -F ' ' '/Apache Maven/ {print $3}'
, as Maven prints out the following (and more!) when asked for its version:
Apache Maven 3.5.2 (138edd61fd100ec658bfa2d307c43b76940a5d7d; 2017-10-18T00:58:13-07:00)
If you want to learn more, you should now be able to read the other functions with ease.
If any of your dependencies are not installed, you need to install them before continuing with the next recipe. It goes beyond the scope of this book to guide you step-by-step through the installation process of all of these, but here are some helpful links to show you how to do it.
Installing Java is pretty straightforward.
On macOS, go to https://www.java.com/en/download/mac_download.jsp and download the version appropriate for your system. Once downloaded, follow the instructions to install it on your machine. If you require more detailed instructions, check this link: http://bit.ly/2idEozX.
On Linux, check the following link http://bit.ly/2jGwuz1 for Linux Java installation instructions.
We have been using (and highly recommend) the Anaconda version of Python as it comes with the most commonly used packages included with the installer. It also comes built-in with the conda package management tool that makes installing other packages a breeze.
You can download Anaconda from http://www.continuum.io/downloads; select the appropriate version that will fulfill Spark's requirements. For macOS installation instructions, you can go to http://bit.ly/2zZPuUf and for a Linux installation manual check, you can go to http://bit.ly/2ASLUvg.
R is distributed via Comprehensive R Archive Network (CRAN). The macOS version can be downloaded from here, https://cran.r-project.org/bin/macosx/, whereas the Linux one is available here: https://cran.r-project.org/bin/linux/.
