Frank Kane's Taming Big Data with Apache Spark and Python - Frank Kane - E-Book

Frank Kane's Taming Big Data with Apache Spark and Python E-Book

Frank Kane

0,0
31,19 €

-100%
Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.
Mehr erfahren.
Beschreibung

Frank Kane’s Taming Big Data with Apache Spark and Python is your companion to learning Apache Spark in a hands-on manner. Frank will start you off by teaching you how to set up Spark on a single system or on a cluster, and you’ll soon move on to analyzing large data sets using Spark RDD, and developing and running effective Spark jobs quickly using Python.

Apache Spark has emerged as the next big thing in the Big Data domain – quickly rising from an ascending technology to an established superstar in just a matter of years. Spark allows you to quickly extract actionable insights from large amounts of data, on a real-time basis, making it an essential tool in many modern businesses.

Frank has packed this book with over 15 interactive, fun-filled examples relevant to the real world, and he will empower you to understand the Spark ecosystem and implement production-grade real-time Spark projects with ease.

Das E-Book können Sie in Legimi-Apps oder einer beliebigen App lesen, die das folgende Format unterstützen:

EPUB
MOBI

Seitenzahl: 324

Veröffentlichungsjahr: 2017

Bewertungen
0,0
0
0
0
0
0
Mehr Informationen
Mehr Informationen
Legimi prüft nicht, ob Rezensionen von Nutzern stammen, die den betreffenden Titel tatsächlich gekauft oder gelesen/gehört haben. Wir entfernen aber gefälschte Rezensionen.



Frank Kane's Taming Big Data with Apache Spark and Python

                   

 

 

 

 

 

 

 

 

 

 

Real-world examples to help you analyze large datasets with Apache Spark

 

 

 

 

 

 

 

 

 

 

                   

Frank Kane

 

 

 

 

 

 

 

 

 

BIRMINGHAM - MUMBAI

< html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">

Frank Kane's Taming Big Data with Apache Spark and Python

 

 

Copyright © 2017 Packt Publishing

 

 

All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing, and its dealers and distributors will be held liable for any damages caused or alleged to be caused directly or indirectly by this book.

Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

 

First published: June 2017

 

Production reference: 1290617

 

Published by Packt Publishing Ltd.
Livery Place
35 Livery Street
Birmingham 
B3 2PB, UK.

 

ISBN 978-1-78728-794-5

www.packtpub.com

Credits

Author

Frank Kane

Project Coordinator   

 Suzanne Coutinho

 

Commissioning Editor  

Ben Renow-Clarke

 

 

 

Proofreader

Safis Editing

 

Acquisition Editor

Ben Renow-Clarke

Indexer

Aishwarya Gangawane

 

Content Development Editor

Monika Sangwan

Graphics

Kirk D'Penha

 

Technical Editor

Nidhisha Shetty

 

Production Coordinator

Arvindkumar Gupta

 

 

 

Copy Editor

Tom Jacob

 

 

 

About the Author

My name is Frank Kane. I spent nine years at amazon.com and imdb.com, wrangling millions of customer ratings and customer transactions to produce things such as personalized recommendations for movies and products and "people who bought this also bought." I tell you, I wish we had Apache Spark back then, when I spent years trying to solve these problems there. I hold 17 issued patents in the fields of distributed computing, data mining, and machine learning. In 2012, I left to start my own successful company, Sundog Software, which focuses on virtual reality environment technology, and teaching others about big data analysis.

www.PacktPub.com

For support files and downloads related to your book, please visit www.PacktPub.com.

Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.PacktPub.com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at [email protected] for more details.

At www.PacktPub.com, you can also read a collection of free technical articles, sign up for a range of free newsletters and receive exclusive discounts and offers on Packt books and eBooks.

https://www.packtpub.com/mapt

Get the most in-demand software skills with Mapt. Mapt gives you full access to all Packt books and video courses, as well as industry-leading tools to help you plan your personal development and advance your career.

Why subscribe?

Fully searchable across every book published by Packt

Copy and paste, print, and bookmark content

On demand and accessible via a web browser

Customer Feedback

Thanks for purchasing this Packt book. At Packt, quality is at the heart of our editorial process. To help us improve, please leave us an honest review on this book's Amazon page at https://www.amazon.com/dp/1787287947.

If you'd like to join our team of regular reviewers, you can e-mail us at [email protected]. We award our regular reviewers with free eBooks and videos in exchange for their valuable feedback. Help us be relentless in improving our products!

Table of Contents

Preface

What this book covers

What you need for this book

Who this book is for

Conventions

Reader feedback

Customer support

Downloading the example code

Downloading the color images of this book

Errata

Piracy

Questions

Getting Started with Spark

Getting set up - installing Python, a JDK, and Spark and its dependencies

Installing Enthought Canopy

Installing the Java Development Kit

Installing Spark

Running Spark code

Installing the MovieLens movie rating dataset

Run your first Spark program - the ratings histogram example

Examining the ratings counter script

Running the ratings counter script

Summary

Spark Basics and Spark Examples

What is Spark?

Spark is scalable

Spark is fast

Spark is hot

Spark is not that hard

Components of Spark

Using Python with Spark

The Resilient Distributed Dataset (RDD)

What is the RDD?

The SparkContext object

Creating RDDs

Transforming RDDs

Map example

RDD actions

Ratings histogram walk-through

Understanding the code

Setting up the SparkContext object

Loading the data

Extract (MAP) the data we care about

Perform an action - count by value

Sort and display the results

Looking at the ratings-counter script in Canopy

Key/value RDDs and the average friends by age example

Key/value concepts - RDDs can hold key/value pairs

Creating a key/value RDD

What Spark can do with key/value data?

Mapping the values of a key/value RDD

The friends by age example

Parsing (mapping) the input data

Counting up the sum of friends and number of entries per age

Compute averages

Collect and display the results

Running the average friends by age example

Examining the script

Running the code

Filtering RDDs and the minimum temperature by location example

What is filter()

The source data for the minimum temperature by location example

Parse (map) the input data

Filter out all but the TMIN entries

Create (station ID, temperature) key/value pairs

Find minimum temperature by station ID

Collect and print results

Running the minimum temperature example and modifying it for maximums

Examining the min-temperatures script

Running the script

Running the maximum temperature by location example

Counting word occurrences using flatmap()

Map versus flatmap

Map ()

Flatmap ()

Code sample - count the words in a book

Improving the word-count script with regular expressions

Text normalization

Examining the use of regular expressions in the word-count script

Running the code

Sorting the word count results

Step 1 - Implement countByValue() the hard way to create a new RDD

Step 2 - Sort the new RDD

Examining the script

Running the code

Find the total amount spent by customer

Introducing the problem

Strategy for solving the problem

Useful snippets of code

Check your results and sort them by the total amount spent

Check your sorted implementation and results against mine

Summary

Advanced Examples of Spark Programs

Finding the most popular movie

Examining the popular-movies script

Getting results

Using broadcast variables to display movie names instead of ID numbers

Introducing broadcast variables

Examining the popular-movies-nicer.py script

Getting results

Finding the most popular superhero in a social graph

Superhero social networks

Input data format

Strategy

Running the script - discover who the most popular superhero is

Mapping input data to (hero ID, number of co-occurrences) per line

Adding up co-occurrence by hero ID

Flipping the (map) RDD to (number, hero ID)

Using max() and looking up the name of the winner

Getting results

Superhero degrees of separation - introducing the breadth-first search algorithm

Degrees of separation

How the breadth-first search algorithm works?

The initial condition of our social graph

First pass through the graph

Second pass through the graph

Third pass through the graph

Final pass through the graph

Accumulators and implementing BFS in Spark

Convert the input file into structured data

Writing code to convert Marvel-Graph.txt to BFS nodes

Iteratively process the RDD

Using a mapper and a reducer

How do we know when we're done?

Superhero degrees of separation - review the code and run it

Setting up an accumulator and using the convert to BFS function

Calling flatMap()

Calling an action

Calling reduceByKey

Getting results

Item-based collaborative filtering in Spark, cache(), and persist()

How does item-based collaborative filtering work?

Making item-based collaborative filtering a Spark problem

It's getting real

Caching RDDs

Running the similar-movies script using Spark's cluster manager

Examining the script

Getting results

Improving the quality of the similar movies example

Summary

Running Spark on a Cluster

Introducing Elastic MapReduce

Why use Elastic MapReduce?

Warning - Spark on EMR is not cheap

Setting up our Amazon Web Services / Elastic MapReduce account and PuTTY

Partitioning

Using .partitionBy()

Choosing a partition size

Creating similar movies from one million ratings - part 1

Changes to the script

Creating similar movies from one million ratings - part 2

Our strategy

Specifying memory per executor

Specifying a cluster manager

Running on a cluster

Setting up to run the movie-similarities-1m.py script on a cluster

Preparing the script

Creating a cluster

Connecting to the master node using SSH

Running the code

Creating similar movies from one million ratings – part 3

Assessing the results

Terminating the cluster

Troubleshooting Spark on a cluster

More troubleshooting and managing dependencies

Troubleshooting

Managing dependencies

Summary

SparkSQL, DataFrames, and DataSets

Introducing SparkSQL

Using SparkSQL in Python

More things you can do with DataFrames

Differences between DataFrames and DataSets

Shell access in SparkSQL

User-defined functions (UDFs)

Executing SQL commands and SQL-style functions on a DataFrame

Using SQL-style functions instead of queries

Using DataFrames instead of RDDs

Summary

Other Spark Technologies and Libraries

Introducing MLlib

MLlib capabilities

Special MLlib data types

For more information on machine learning

Making movie recommendations

Using MLlib to produce movie recommendations

Examining the movie-recommendations-als.py script

Analyzing the ALS recommendations results

Why did we get bad results?

Using DataFrames with MLlib

Examining the spark-linear-regression.py script

Getting results

Spark Streaming and GraphX

What is Spark Streaming?

GraphX

Summary

Where to Go From Here? – Learning More About Spark and Data Science

Preface

We will do some really quick housekeeping here, just so you know where to put all the stuff for this book. First, I want you to go to your hard drive, create a new folder called SparkCourse, and put it in a place where you're going to remember it is:

For me, I put that in my C drive in a folder called SparkCourse. This is where you're going to put everything for this book. As you go through the individual sections of this book, you'll see that there are resources provided for each one. There can be different kinds of resources, files, and downloads. When you download them, make sure you put them in this folder that you have created. This is the ultimate destination of everything you're going to download for this book, as you can see in my SparkCourse folder, shown in the following screenshot; you'll just accumulate all this stuff over time as you work your way through it:

So, remember where you put it all, you might need to refer to these files by their path, in this case, C:\SparkCourse. Just make sure you download them to a consistent place and you should be good to go. Also, be cognizant of the differences in file paths between operating systems. If you're on Mac or Linux, you're not going to have a C drive; you'll just have a slash and the full path name. Capitalization might be important, while it's not in Windows. Using forward slashes instead of backslashes in paths is another difference between other operating systems and Windows. So if you are using something other than Windows, just remember these differences, don't let them trip you up. If you see a path to a file and a script, make sure you adjust it accordingly to make sense of where you put these files and what your operating system is.

What this book covers

Chapter 1, Getting Started with Spark, covers basic installation instructions for Spark and its related software. This chapter illustrates a simple example of data analysis of real movie ratings data provided by different sets of people.

Chapter 2, Spark Basics and Simple Examples, provides a brief overview of what Spark is all about, who uses it, how it helps in analyzing big data, and why it is so popular.

Chapter3, Advanced Examples of Spark Programs, illustrates some advanced and complicated examples with Spark.

Chapter 4, Running Spark on a Cluster, talks about Spark Core, covering the things you can do with Spark, such as running Spark in the cloud on a cluster, analyzing a real cluster in the cloud using Spark, and so on.

Chapter 5, SparkSQL, DataFrames, and DataSets, introduces SparkSQL, which is an important concept of Spark, and explains how to deal with structured data formats using this.

Chapter 6, Other Spark Technologies and Libraries, talks about MLlib (Machine Learning library), which is very helpful if you want to work on data mining or machine learning-related jobs with Spark. This chapter also covers Spark Streaming and GraphX; technologies built on top of Spark.

Chapter 7, Where to Go From Here? - Learning More About Spark and Data Science, talks about some books related to Spark if the readers want to know more on this topic.

What you need for this book

For this book you’ll need a Python development environment (Python 3.5 or newer), a Canopy installer, Java Development Kit, and of course Spark itself (Spark 2.0 and beyond).

We'll show you how to install this software in first chapter of the book.

This book is based on the Windows operating system, so installations are provided according to it. If you have Mac or Linux, you can follow this URL http://media.sundog-soft.com/spark-python-install.pdf, which contains written instructions on getting everything set up on Mac OS and on Linux.

Who this book is for

I wrote this book for people who have at least some programming or scripting experience in their background. We're going to be using the Python programming language throughout this book, which is very easy to pick up, and I'm going to give you over 15 real hands-on examples of Spark Python scripts that you can run yourself, mess around with, and learn from. So, by the end of this book, you should have the skills needed to actually turn business problems into Spark problems, code up that Spark code on your own, and actually run it in the cluster on your own. 

Reader feedback

Feedback from our readers is always welcome. Let us know what you think about this book-what you liked or disliked. Reader feedback is important for us as it helps us develop titles that you will really get the most out of. To send us general feedback, simply e-mail [email protected], and mention the book's title in the subject of your message. If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, see our author guide at www.packtpub.com/authors.

Customer support

Now that you are the proud owner of a Packt book, we have a number of things to help you to get the most from your purchase.

Downloading the example code

You can download the example code files for this book from your account at http://www.packtpub.com. If you purchased this book elsewhere, you can visit http://www.packtpub.com/support and register to have the files e-mailed directly to you. You can download the code files by following these steps:

Log in or register to our website using your e-mail address and password.

Hover the mouse pointer on the

SUPPORT

tab at the top.

Click on

Code Downloads & Errata

.

Enter the name of the book in the

Search

box.

Select the book for which you're looking to download the code files.

Choose from the drop-down menu where you purchased this book from.

Click on

Code Download

.

Once the file is downloaded, please make sure that you unzip or extract the folder using the latest version of:

WinRAR / 7-Zip for Windows

Zipeg / iZip / UnRarX for Mac

7-Zip / PeaZip for Linux

The code bundle for the book is also hosted on GitHub at https://github.com/PacktPublishing/Frank-Kanes-Taming-Big-Data-with-Apache-Spark-and-Python. We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/. Check them out!

Downloading the color images of this book

We also provide you with a PDF file that has color images of the screenshots/diagrams used in this book. The color images will help you better understand the changes in the output. You can download this file from https://www.packtpub.com/sites/default/files/downloads/FrankKanesTamingBigDatawithApacheSparkandPython_ColorImages.pdf.

Errata

Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you find a mistake in one of our books-maybe a mistake in the text or the code-we would be grateful if you could report this to us. By doing so, you can save other readers from frustration and help us improve subsequent versions of this book. If you find any errata, please report them by visiting http://www.packtpub.com/submit-errata, selecting your book, clicking on the Errata Submission Form link, and entering the details of your errata. Once your errata are verified, your submission will be accepted and the errata will be uploaded to our website or added to any list of existing errata under the Errata section of that title. To view the previously submitted errata, go to https://www.packtpub.com/books/content/support and enter the name of the book in the search field. The required information will appear under the Errata section.

Piracy

Piracy of copyrighted material on the Internet is an ongoing problem across all media. At Packt, we take the protection of our copyright and licenses very seriously. If you come across any illegal copies of our works in any form on the Internet, please provide us with the location address or website name immediately so that we can pursue a remedy. Please contact us at [email protected] with a link to the suspected pirated material. We appreciate your help in protecting our authors and our ability to bring you valuable content.

Questions

If you have a problem with any aspect of this book, you can contact us at [email protected], and we will do our best to address the problem

Getting Started with Spark

Spark is one of the hottest technologies in big data analysis right now, and with good reason. If you work for, or you hope to work for, a company that has massive amounts of data to analyze, Spark offers a very fast and very easy way to analyze that data across an entire cluster of computers and spread that processing out. This is a very valuable skill to have right now.

My approach in this book is to start with some simple examples and work our way up to more complex ones. We'll have some fun along the way too. We will use movie ratings data and play around with similar movies and movie recommendations. I also found a social network of superheroes, if you can believe it; we can use this data to do things such as figure out who's the most popular superhero in the fictional superhero universe. Have you heard of the Kevin Bacon number, where everyone in Hollywood is supposedly connected to a Kevin Bacon to a certain extent? We can do the same thing with our superhero data and figure out the degrees of separation between any two superheroes in their fictional universe too. So, we'll have some fun along the way and use some real examples here and turn them into Spark problems. Using Apache Spark is easier than you might think and, with all the exercises and activities in this book, you'll get plenty of practice as we go along. I'll guide you through every line of code and every concept you need along the way. So let's get started and learn Apache Spark.

Getting set up - installing Python, a JDK, and Spark and its dependencies

Let's get you started. There is a lot of software we need to set up. Running Spark on Windows involves a lot of moving pieces, so make sure you follow along carefully, or else you'll have some trouble. I'll try to walk you through it as easily as I can. Now, this chapter is written for Windows users. This doesn't mean that you're out of luck if you're on Mac or Linux though. If you open up the download package for the book or go to this URL, http://media.sundog-soft.com/spark-python-install.pdf, you will find written instructions on getting everything set up on Windows, macOS, and Linux. So, again, you can read through the chapter here for Windows users, and I will call out things that are specific to Windows, so you'll find it useful in other platforms as well; however, either refer to that spark-python-install.pdf file or just follow the instructions here on Windows and let's dive in and get it done.

Installing Enthought Canopy

This book uses Python as its programming language, so the first thing you need is a Python development environment installed on your PC. If you don't have one already, just open up a web browser and head on to https://www.enthought.com/, and we'll install Enthought Canopy:

Enthought Canopy is just my development environment of choice; if you have a different one already that's probably okay. As long as it's Python 3 or a newer environment, you should be covered, but if you need to install a new Python environment or you just want to minimize confusion, I'd recommend that you install Canopy. So, head up to the big friendly download Canopy button here and select your operating system and architecture:

      

For me, the operating system is going to be Windows (64-bit). Make sure you choose Python 3.5 or a newer version of the package. I can't guarantee the scripts in this book will work with Python 2.7; they are built for Python 3, so select Python 3.5 for your OS and download the installer:

There's nothing special about it; it's just your standard Windows Installer, or whatever platform you're on. We'll just accept the defaults, go through it, and allow it to become our default Python environment. Then, when we launch it for the first time, it will spend a couple of minutes setting itself up and all the Python packages that we need. You might want to read the license agreement before you accept it; that's up to you. We'll go ahead, start the installation, and let it run.

Once Canopy installer has finished installing, we should have a nice little Enthought Canopy icon sitting on our desktop. Now, if you're on Windows, I want you to right-click on the Enthought Canopy icon, go to Properties and then to Compatibility (this is on Windows 10), and make sure Run this program as an administrator is checked:

This will make sure that we have all the permissions we need to run our scripts successfully. You can now double-click on the file to open it up:

The next thing we need is a Java Development Kit because Spark runs on top of Scala and Scala runs on top of the Java Runtime environment.

Installing the Java Development Kit

For installing the Java Development Kit, go back to the browser, open a new tab, and just search for jdk (short for Java Development Kit). This will bring you to the Oracle site, from where you can download Java:

On the Oracle website, click on JDK DOWNLOAD. Now, click on Accept License Agreement and then you can select the download option for your operating system:

For me, that's going to be Windows 64-bit and a wait for 198 MB of goodness to download:

Once the download is finished, we can't just accept the default settings in the installer on Windows here. So, this is a Windows-specific workaround, but as of the writing of this book, the current version of Spark is 2.1.1. It turns out there's an issue with Spark 2.1.1 with Java on Windows. The issue is that if you've installed Java to a path that has a space in it, it doesn't work, so we need to make sure that Java is installed to a path that does not have a space in it. This means that you can't skip this step even if you have Java installed already, so let me show you how to do that. On the installer, click on Next, and you will see, as in the following screen, that it wants to install by default to the C:\Program Files\Java\jdk path, whatever the version is:

The space in the Program Files path is going to cause trouble, so let's click on the Change... button and install to c:\jdk, a nice simple path, easy to remember, and with no spaces in it:

Now, it also wants to install the Java Runtime environment; so, just to be safe, I'm also going to install that to a path with no spaces.

At the second step of the JDK installation, we should have this showing on our screen:

I will change that destination folder as well, and we will make a new folder called C:\jre for that:

Alright; successfully installed. Woohoo!

Now, you'll need to remember the path that we installed the JDK into, which, in our case was C:\jdk. We still have a few more steps to go here. So far, we've installed Python and Java, and next we need to install Spark itself.

Installing Spark

Let's us get back to a new browser tab here; head to spark.apache.org, and click on the Download Spark button:

Now, we have used Spark 2.1.1 in this book. So, you know, if given the choice, anything beyond 2.0 should work just fine, but that's where we are today.

Make sure you get a pre-built version, and select a Direct Download option so all these defaults are perfectly fine. Go ahead and click on the link next to instruction number 4 to download that package.

Now, it downloads a TGZ (Tar in GZip) file, so, again, Windows is kind of an afterthought with Spark quite honestly because on Windows, you're not going to have a built-in utility for actually decompressing TGZ files. This means that you might need to install one, if you don't have one already. The one I use is called WinRAR, and you can pick that up from www.rarlab.com. Go to the Downloads page if you need it, and download the installer for WinRAR 32-bit or 64-bit, depending on your operating system. Install WinRAR as normal, and that will allow you to actually decompress TGZ files on Windows:

So, let's go ahead and decompress the TGZ files. I'm going to open up my Downloads folder to find the Spark archive that we downloaded, and let's go ahead and right-click on that archive and extract it to a folder of my choosing; just going to put it in my Downloads folder for now. Again, WinRAR is doing this for me at this point:

So I should now have a folder in my Downloads folder associated with that package. Let's open that up and there is Spark itself. So, you need to install that in some place where you will remember it:

You don't want to leave it in your Downloads folder obviously, so let's go ahead and open up a new file explorer window here. I go to my C drive and create a new folder, and let's just call it spark. So, my Spark installation is going to live in C:\spark. Again, nice and easy to remember. Open that folder. Now, I go back to my downloaded spark folder and use Ctrl + A to select everything in the Spark distribution, Ctrl + C to copy it, and then go back to C:\spark, where I want to put it, and Ctrl + V to paste it in:

Remembering to paste the contents of the spark folder, not the spark folder itself is very important. So what I should have now is my C drive with a spark folder that contains all of the files and folders from the Spark distribution.

Well, there are yet a few things we need to configure. So while we're in C:\spark let's open up the conf folder, and in order to make sure that we don't get spammed to death by log messages, we're going to change the logging level setting here. So to do that, right-click on the log4j.properties.template file and select Rename:

Delete the .template part of the filename to make it an actual log4j.properties file. Spark will use this to configure its logging:

Now, open this file in a text editor of some sort. On Windows, you might need to right-click there and select Open with and then WordPad:

In the file, locate log4j.rootCategory=INFO. Let's change this to log4j.rootCategory=ERROR and this will just remove the clutter of all the log spam that gets printed out when we run stuff. Save the file, and exit your editor.

So far, we installed Python, Java, and Spark. Now the next thing we need to do is to install something that will trick your PC into thinking that Hadoop exists, and again this step is only necessary on Windows. So, you can skip this step if you're on Mac or Linux.

Let's go to http://media.sundog-soft.com/winutils.exe. Downloading winutils.exe will give you a copy of a little snippet of an executable, which can be used to trick Spark into thinking that you actually have Hadoop:

Now, since we're going to be running our scripts locally on our desktop, it's not a big deal, and we don't need to have Hadoop installed for real. This just gets around another quirk of running Spark on Windows. So, now that we have that, let's find it in the Downloads folder, click Ctrl + C to copy it, and let's go to our C drive and create a place for it to live:

So, I create a new folder again, and we will call it winutils:

Now let's open this winutils folder and create a bin folder in it:

Now in this bin folder, I want you to paste the winutils.exe file we downloaded. So you should have C:\winutils\bin and then winutils.exe:

This next step is only required on some systems, but just to be safe, open Command Prompt on Windows. You can do that by going to your Start menu and going down to Windows System, and then clicking on Command Prompt. Here, I want you to type cd c:\winutils\bin, which is where we stuck our winutils.exe file. Now if you type dir, you should see that file there. Now type winutils.exe chmod 777 \tmp\hive. This just makes sure that all the file permissions you need to actually run Spark successfully are in place without any errors. You can close Command Prompt now that you're done with that step. Wow, we're almost done, believe it or not.

Now we need to set some environment variables for things to work. I'll show you how to do that on Windows. On Windows 10, you'll need to open up the Start menu and go to Windows System | Control Panel to open up Control Panel:

In Control Panel, click on System and Security:

Then, click on System:

Then click on Advanced system settings from the list on the left-hand side:

From here, click on Environment Variables...:

We will get these options:

Now, this is a very Windows-specific way of setting environment variables. On other operating systems, you'll use different processes, so you'll have to look at how to install Spark on them. Here, we're going to set up some new user variables. Click on the New... button for a new user variable and call it SPARK_HOME, as shown as follows, all uppercase. This is going to point to where we installed Spark, which for us is c:\spark, so type that in as the Variable value and click on OK:

We also need to set up JAVA_HOME, so click on New... again and type in JAVA_HOME as Variable name. We need to point that to where we installed Java, which for us is c:\jdk:

We also need to set up HADOOP_HOME, and that's where we installed the winutils package, so we'll point that to c:\winutils:

So far, so good. The last thing we need to do is to modify our path. You should have a PATH environment variable here:

Click on the PATH environment variable, then on Edit..., and add a new path. This is going to be %SPARK_HOME%\bin, and I'm going to add another one, %JAVA_HOME%\bin:

Basically, this makes all the binary executables of Spark available to Windows, wherever you're running it from. Click on OK on this menu and on the previous two menus. We finally have everything set up. So, let's go ahead and try it all out in our next step.

Running Spark code

Let's go ahead and start up Enthought Canopy. Once you get to the Welcome screen, go to the Tools menu and then to Canopy Command Prompt. This will give you a little Command Prompt you can use; it has all the right permissions and environment variables you need to actually run Python.

So type in cd c:\spark, as shown here, which is where we installed Spark in our previous steps:

We'll make sure that we have Spark in there, so you should see all the contents of the Spark distribution pre-built. Let's look at what's in here by typing dir and hitting Enter:

Now, depending on the distribution that you downloaded, there might be a README.md file or a CHANGES.txt file, so pick one or the other; whatever you see there, that's what we're going to use.

We will set up a little simple Spark program here that just counts the number of lines in that file, so let's type in pyspark to kick off the Python version of the Spark interpreter:

If everything is set up properly, you should see something like this:

If you're not seeing this and you're seeing some weird Windows error about not being able to find pyspark, go back and double-check all those environment variables. The odds are that there's something wrong with your path or with your SPARK_HOME environment variables. Sometimes you need to log out of Windows and log back in, in order to get environment variable changes to get picked up by the system; so, if all else fails, try this. Also, if you got cute and installed things to a different path than I recommended in the setup sections, make sure that your environment variables reflect those changes. If you put it in a folder that has spaces in the name, that can cause problems as well. You might run into trouble if your path is too long or if you have too much stuff in your path, so have a look at that if you're encountering problems at this stage. Another possibility is that you're running on a managed PC that doesn't actually allow you to change environment variables, so you might have thought you did it, but there might be some administrative policy preventing you from doing so. If so, try running the set up steps again under a new account that's an administrator if possible. However, assuming you've gotten this far, let's have some fun.