Mastering Spark for Data Science - Andrew Morgan - E-Book

Mastering Spark for Data Science E-Book

Andrew Morgan

0,0
44,39 €

-100%
Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.

Mehr erfahren.
Beschreibung

Data science seeks to transform the world using data, and this is typically achieved
through disrupting and changing real processes in real industries. In order to operate at this level you need to build data science solutions of substance –solutions that solve real problems. Spark has emerged as the big data platform of choice for data scientists due to its speed, scalability, and easy-to-use APIs.
This book deep dives into using Spark to deliver production-grade data science
solutions. This process is demonstrated by exploring the construction of a sophisticated global news analysis service that uses Spark to generate continuous geopolitical and current affairs insights.You will learn all about the core Spark APIs and take a comprehensive tour of advanced libraries, including Spark SQL, Spark Streaming, MLlib, and more.

You will be introduced to advanced techniques and methods that will help you to construct commercial-grade data products. Focusing on a sequence of tutorials that deliver a working news intelligence service, you will learn about advanced Spark architectures, how to work with geographic data in Spark, and how to tune Spark algorithms so they scale linearly.

Das E-Book können Sie in Legimi-Apps oder einer beliebigen App lesen, die das folgende Format unterstützen:

EPUB
MOBI

Seitenzahl: 616

Veröffentlichungsjahr: 2017

Bewertungen
0,0
0
0
0
0
0
Mehr Informationen
Mehr Informationen
Legimi prüft nicht, ob Rezensionen von Nutzern stammen, die den betreffenden Titel tatsächlich gekauft oder gelesen/gehört haben. Wir entfernen aber gefälschte Rezensionen.



Table of Contents

Mastering Spark for Data Science
Credits
Foreword
About the Authors
About the Reviewer
www.PacktPub.com
Why subscribe?
Customer Feedback
Preface
What this book covers
What you need for this book
Who this book is for
Conventions
Reader feedback
Customer support
Downloading the example code
Downloading the color images of this book
Errata
Piracy
Questions
1. The Big Data Science Ecosystem
Introducing the Big Data ecosystem
Data management
Data management responsibilities
The right tool for the job
Overall architecture
Data Ingestion
Data Lake
Reliable storage
Scalable data processing capability
Data science platform
Data Access
Data technologies
The role of Apache Spark
Companion tools
Apache HDFS
Advantages
Disadvantages
Installation
Amazon S3
Advantages
Disadvantages
Installation
Apache Kafka
Advantages
Disadvantages
Installation
Apache Parquet
Advantages
Disadvantages
Installation
Apache Avro
Advantages
Disadvantages
Installation
Apache NiFi
Advantages
Disadvantages
Installation
Apache YARN
Advantages
Disadvantages
Installation
Apache Lucene
Advantages
Disadvantages
Installation
Kibana
Advantages
Disadvantages
Installation
Elasticsearch
Advantages
Disadvantages
Installation
Accumulo
Advantages
Disadvantages
Installation
Summary
2. Data Acquisition
Data pipelines
Universal ingestion framework
Introducing the GDELT news stream
Discovering GDELT in real-time
Our first GDELT feed
Improving with publish and subscribe
Content registry
Choices and more choices
Going with the flow
Metadata model
Kibana dashboard
Quality assurance
Example 1 - Basic quality checking, no contending users
Example 2 - Advanced quality checking, no contending users
Example 3 - Basic quality checking, 50% utility due to contending users
Summary
3. Input Formats and Schema
A structured life is a good life
GDELT dimensional modeling
GDELT model
First look at the data
Core global knowledge graph model
Hidden complexity
Denormalized models
Challenges with flattened data
Issue 1 - Loss of contextual information
Issue 2: Re-establishing dimensions
Issue 3: Including reference data
Loading your data
Schema agility
Reality check
GKG ELT
Position matters
Avro
Spark-Avro method
Pedagogical method
When to perform Avro transformation
Parquet
Summary
4. Exploratory Data Analysis
The problem, principles and planning
Understanding the EDA problem
Design principles
General plan of exploration
Preparation
Introducing mask based data profiling
Introducing character class masks
Building a mask based profiler
Setting up Apache Zeppelin
Constructing a reusable notebook
Exploring GDELT
GDELT GKG datasets
The files
Special collections
Reference data
Exploring the GKG v2.1
The Translingual files
A configurable GCAM time series EDA
Plot.ly charting on Apache Zeppelin
Exploring translation sourced GCAM sentiment with plot.ly
Concluding remarks
A configurable GCAM Spatio-Temporal EDA
Introducing GeoGCAM
Does our spatial pivot work?
Summary
5. Spark for Geographic Analysis
GDELT and oil
GDELT events
GDELT GKG
Formulating a plan of action
GeoMesa
Installing
GDELT Ingest
GeoMesa Ingest
MapReduce to Spark
Geohash
GeoServer
Map layers
CQL
Gauging oil prices
Using the GeoMesa query API
Data preparation
Machine learning
Naive Bayes
Results
Analysis
Summary
6. Scraping Link-Based External Data
Building a web scale news scanner
Accessing the web content
The Goose library
Integration with Spark
Scala compatibility
Serialization issues
Creating a scalable, production-ready library
Build once, read many
Exception handling
Performance tuning
Named entity recognition
Scala libraries
NLP walkthrough
Extracting entities
Abstracting methods
Building a scalable code
Build once, read many
Scalability is also a state of mind
Performance tuning
GIS lookup
GeoNames dataset
Building an efficient join
Offline strategy - Bloom filtering
Online strategy - Hash partitioning
Content deduplication
Context learning
Location scoring
Names de-duplication
Functional programming with Scalaz
Our de-duplication strategy
Using the mappend operator
Simple clean
DoubleMetaphone
News index dashboard
Summary
7. Building Communities
Building a graph of persons
Contact chaining
Extracting data from Elasticsearch
Using the Accumulo database
Setup Accumulo
Cell security
Iterators
Elasticsearch to Accumulo
A graph data model in Accumulo
Hadoop input and output formats
Reading from Accumulo
AccumuloGraphxInputFormat and EdgeWritable
Building a graph
Community detection algorithm
Louvain algorithm
Weighted Community Clustering (WCC)
Description
Preprocessing stage
Initial communities
Message passing
Community back propagation
WCC iteration
Gathering community statistics
WCC Computation
WCC iteration
GDELT dataset
The Bowie effect
Smaller communities
Using Accumulo cell level security
Summary
8. Building a Recommendation System
Different approaches
Collaborative filtering
Content-based filtering
Custom approach
Uninformed data
Processing bytes
Creating a scalable code
From time to frequency domain
Fast Fourier transform
Sampling by time window
Extracting audio signatures
Building a song analyzer
Selling data science is all about selling cupcakes
Using Cassandra
Using the Play framework
Building a recommender
The PageRank algorithm
Building a Graph of Frequency Co-occurrence
Running PageRank
Building personalized playlists
Expanding our cupcake factory
Building a playlist service
Leveraging the Spark job server
User interface
Summary
9. News Dictionary and Real-Time Tagging System
The mechanical Turk
Human intelligence tasks
Bootstrapping a classification model
Learning from Stack Exchange
Building text features
Training a Naive Bayes model
Laziness, impatience, and hubris
Designing a Spark Streaming application
A tale of two architectures
The CAP theorem
The Greeks are here to help
Importance of the Lambda architecture
Importance of the Kappa architecture
Consuming data streams
Creating a GDELT data stream
Creating a Kafka topic
Publishing content to a Kafka topic
Consuming Kafka from Spark Streaming
Creating a Twitter data stream
Processing Twitter data
Extracting URLs and hashtags
Keeping popular hashtags
Expanding shortened URLs
Fetching HTML content
Using Elasticsearch as a caching layer
Classifying data
Training a Naive Bayes model
Thread safety
Predict the GDELT data
Our Twitter mechanical Turk
Summary
10. Story De-duplication and Mutation
Detecting near duplicates
First steps with hashing
Standing on the shoulders of the Internet giants
Simhashing
The hamming weight
Detecting near duplicates in GDELT
Indexing the GDELT database
Persisting our RDDs
Building a REST API
Area of improvement
Building stories
Building term frequency vectors
The curse of dimensionality, the data science plague
Optimizing KMeans
Story mutation
The Equilibrium state
Tracking stories over time
Building a streaming application
Streaming KMeans
Visualization
Building story connections
Summary
11. Anomaly Detection on Sentiment Analysis
Following the US elections on Twitter
Acquiring data in stream
Acquiring data in batch
The search API
Rate limit
Analysing sentiment
Massaging Twitter data
Using the Stanford NLP
Building the Pipeline
Using Timely as a time series database
Storing data
Using Grafana to visualize sentiment
Number of processed tweets
Give me my Twitter account back
Identifying the swing states
Twitter and the Godwin point
Learning context
Visualizing our model
Word2Graph and Godwin point
Building a Word2Graph
Random walks
A Small Step into sarcasm detection
Building features
#LoveTrumpsHates
Scoring Emojis
Training a KMeans model
Detecting anomalies
Summary
12. TrendCalculus
Studying trends
The TrendCalculus algorithm
Trend windows
Simple trend
User Defined Aggregate Functions
Simple trend calculation
Reversal rule
Introducing the FHLS bar structure
Visualize the data
FHLS with reversals
Edge cases
Zero values
Completing the gaps
Stackable processing
Practical applications
Algorithm characteristics
Advantages
Disadvantages
Possible use cases
Chart annotation
Co-trending
Data reduction
Indexing
Fractal dimension
Streaming proxy for piecewise linear regression
Summary
13. Secure Data
Data security
The problem
The basics
Authentication and authorization
Access control lists (ACL)
Role-based access control (RBAC)
Access
Encryption
Data at rest
Java KeyStore
S3 encryption
Data in transit
Obfuscation/Anonymizing
Masking
Tokenization
Using a Hybrid approach
Data disposal
Kerberos authentication
Use case 1: Apache Spark accessing data in secure HDFS
Use case 2: extending to automated authentication
Use case 3: connecting to secure databases from Spark
Security ecosystem
Apache sentry
RecordService
Apache ranger
Apache Knox
Your Secure Responsibility
Summary
14. Scalable Algorithms
General principles
Spark architecture
History of Spark
Moving parts
Driver
SparkSession
Resilient distributed datasets (RDDs)
Executor
Shuffle operation
Cluster Manager
Task
DAG
DAG scheduler
Transformations
Stages
Actions
Task scheduler
Challenges
Algorithmic complexity
Numerical anomalies
Shuffle
Data schemes
Plotting your course
Be iterative
Data preparation
Scale up slowly
Estimate performance
Step through carefully
Tune your analytic
Design patterns and techniques
Spark APIs
Problem
Solution
Example
Summary pattern
Problem
Solution
Example
Expand and Conquer Pattern
Problem
Solution
Lightweight Shuffle
Problem
Solution
Wide Table pattern
Problem
Solution
Example
Broadcast variables pattern
Problem
Solution
Creating a broadcast variable
Accessing a broadcast variable
Removing a broadcast variable
Example
Combiner pattern
Problem
Solution
Example
Optimized cluster
Problem
Solution
Redistribution pattern
Problem
Solution
Example
Salting key pattern
Problem
Solution
Secondary sort pattern
Problem
Solution
Example
Filter overkill pattern
Problem
Solution
Probabilistic algorithms
Problem
Solution
Example
Selective caching
Problem
Solution
Garbage collection
Problem
Solution
Graph traversal
Problem
Solution
Example
Summary

Mastering Spark for Data Science

Mastering Spark for Data Science

Copyright © 2017 Packt Publishing

All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the authors, nor Packt Publishing, and its dealers and distributors will be held liable for any damages caused or alleged to be caused directly or indirectly by this book.

Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

First published: March 2017

Production reference: 1240317

Published by Packt Publishing Ltd.

Livery Place

35 Livery Street

Birmingham 

B3 2PB, UK.

ISBN 978-1-78588-214-2

www.packtpub.com

Credits

Authors

Andrew Morgan

Antoine Amend

David George

Matthew Hallett

Copy Editor

Safis Editing

Reviewer

Sumit Pal

Project Coordinator

Shweta H Birwatkar 

Commissioning Editor

Akram Hussain

Proofreader

Safis Editing

Acquisition Editor

Vinay Argekar

Indexer

Pratik Shirodkar

Content Development Editor

Amrita Noronha

Graphics

Tania Dutta 

Technical Editor

Sneha Hanchate

Production Coordinator

Arvindkumar Gupta

Foreword

The impact of Spark on the world of data science has been startling. It is less than 3 years since Spark 1.0 was released and yet Spark is already accepted as the omni-competent kernel of any big data architecture. We adopted Spark as our core technology at Barclays around this time and this was considered a bold (read ‘rash’) move. Now it is taken as a given that Spark is your starting point for any big data science project.

As data science has developed both as an activity and as an accepted term, there has been much talk about the unicorn data scientist. This is the unlikely character who can do both the maths and the coding. They are apparently hard to find, and harder to keep. My team likes to think more in terms of three data science competencies: pattern recognition, distributed computation, and automation. If data science is about exploiting insights from data in production, then you need to be able to develop applications with these three competencies in mind from the start. There is no point using a machine learning methodology that won’t scale with your data, or building an analytical kernel that needs to be re-coded to be production quality. And so you need either a unicorn or a unicorn-team (my preference) to do the work.

Spark is your unicorn technology. No other language not only expresses analytical concepts elegantly but also moves effortlessly from the small scale to big data, and so naturally facilitates production-ready code as Spark (with the Scala API). We chose Spark because we could compose a model in a few lines, run the same code on the cluster as we had tried out on the laptop, and build robust unit-tested JVM applications that we could be confident would run in business-critical use cases. The combination of functional programming in Scala with the Spark abstractions is uniquely powerful, and choosing it has been a significant cause of the success of the team over the last 3 years.

So here's the conundrum. Why are there no books which present Spark in this way, recognizing that one of the best reasons to work in Spark is its application to production data science? If you scan the bookshelves (or look at tutorials online) all you will find is toy models and a review of the Spark APIs and libs. You will find little or nothing about how Spark fits into the wider architecture, or about how to manage data ETL in a sustainable way.

I think you will find that the practical approach taken by the authors in this book is different. Each chapter takes on a new challenge, and each reads as a voyage of discovery where the outcome was not necessarily known in advance of the exploration. And the value of doing data science properly is set out clearly from the start. This is one of the first books on Spark for grown-ups who want to do real data science that will make an impact on their organisation. I hope you enjoy it.

Harry Powell

Head of Advanced Analytics, Barclays

About the Authors

Andrew Morgan is a specialist in data strategy and its execution, and has deep experience in the supporting technologies, system architecture, and data science that bring it to life. With over 20 years of experience in the data industry, he has worked designing systems for some of its most prestigious players and their global clients – often on large, complex and international projects. In 2013, he founded ByteSumo Ltd, a data science and big data engineering consultancy, and he now works with clients in Europe and the USA. Andrew is an active data scientist, and the inventor of the TrendCalculus algorithm. It was developed as part of his ongoing research project investigating long-range predictions based on machine learning the patterns found in drifting cultural, geopolitical and economic trends. He also sits on the Hadoop Summit EU data science selection committee, and has spoken at many conferences on a variety of data topics. He also enjoys participating in the Data Science and Big Data communities where he lives in London.

This book is dedicated to my wife Steffy, to my children Alice and Adele, and to all my friends and colleagues who have been endlessly supportive. It is also dedicated to the memory of one my earliest mentors whom I studied under at the University of Toronto, a Professor Ferenc Csillag. Back in 1994, Ferko inspired me with visions of a future where we could use planet-wide data collection and sophisticated algorithms to monitor and optimize the world around us. It was an idea that changed my life, and his dream of a world saved by Big Data Science, is one I’m still chasing.

Antoine Amend is a data scientist passionate about big data engineering and scalable computing. The book’s theme of torturing astronomical amounts of unstructured data to gain new insights mainly comes from his background in theoretical physics. Graduating in 2008 with a Msc. in Astrophysics, he worked for a large consultancy business in Switzerland before discovering the concept of big data at the early stages of Hadoop. He has embraced big data technologies ever since, and is now working as the Head of Data Science for cyber security at Barclays Bank. By combining a scientific approach with core IT skills, Antoine qualified two years running for the Big Data World Championships finals held in Austin TX. He Placed in the top 12 in both 2014 and 2015 edition (over 2000+ competitors) where he additionally won the Innovation Award using the methodologies and technologies explained in this book.

I would like to thank my wife for standing beside me, she has been my motivation for continuing to improve my knowledge and move my career forward. I thank my wonderful kids for always teaching me how to step back whenever it is necessary to clear my mind and get fresh new ideas.

I would like to extend my thanks to my co-workers, especially Dr. Samuel Assefa, Dr. Eirini Spyropoulou and Will Hardman for their patience listening to my crazy theories, and everyone else I had the pleasure to work with over the past few years. Finally, I want to address a special thanks to all my previous managers and mentors who helped me shape up my career in data and analytics; thanks to Manu, Toby, Gary and Harry.

David George is a distinguished distributed computing expert with 15+ years of data systems experience, mainly with globally recognized IT consultancies and brands. Working with core Hadoop technologies since the early days, he has delivered implementations at the largest scale. David always takes a pragmatic approach to software design and values elegance in simplicity.

Today he continues to work as a lead engineer, designing scalable applications for financial sector customers with some of the toughest requirements. His latest projects focus on the adoption of advanced AI techniques for increasing levels of automation across knowledge-based industries.

For Ellie, Shannon, Pauline and Pumpkin – here’s to the sequel!

Matthew Hallett is a Software Engineer and Computer Scientist with over 15 years of industry experience. He is an expert Object Oriented programmer and systems engineer with extensive knowledge of low level programming paradigms and, for the last 8 years, has developed an expertise in Hadoop and distributed programming within mission critical environments, comprising multi­thousand­node data centres. With consultancy experience in distributed algorithms and the implementation of distributed computing architectures, in a variety of languages, Matthew is currently a Consultant Data Engineer in the Data Science & Engineering team at a top four audit firm.

Lynnie, thanks for your understanding and sacrifices that afforded me the time during late nights, weekends and holidays to write this book. Nugget, you make it all worthwhile.

We would also like to thank Gary Richardson, Dr David Pryce, Dr Helen Ramsden, Dr Sima Reichenbach and Dr Fabio Petroni for their invaluable advice and guidance that has led to the completion of this huge project – without their help and contributions this book may never have been completed!

About the Reviewer

Sumit Pal is an author who has published SQL on Big Data - Technology, Architecture and Innovations with Apress. Sumit has more than 22 years of experience in the software industry in various roles spanning companies from startups to enterprises.

He is an independent consultant working with big data, data visualization, and data science and a software architect building end-to-end, data-driven analytic systems. 

Sumit has worked for Microsoft (SQL server development team), Oracle (OLAP development team), and Verizon (Big Data analytics team) in a career spanning 22 years. Currently, he works for multiple clients advising them on their data architectures and big data solutions and does hands-on coding with Spark, Scala, Java, and Python. 

Sumit has spoken at big data conferences in Boston, Chicago, Las Vegas, and Vancouver. Sumit is also the author of the book on the same topic published by Apress in October 2016.

He has extensive experience in building scalable systems across the stack from middletier, data tier, to visualization for analytics applications, using BigData and NoSQL DB. Sumit has deep expertise in DataBase Internals, Data Warehouses, Dimensional Modeling, Data Science with Java, Python, and SQL.

Sumit started his career being a part of SQLServer development team at Microsoft in 1996-97 and then as a core server engineer for Oracle Corporation at their OLAP development team in Burlington, MA.

Sumit has also worked at Verizon as an associate director for big data architecture, where he strategized, managed, architected, and developed platforms and solutions for analytics and machine learning applications.

Sumit has also served as a chief architect at ModelN/LeapfrogRX (2006-2013), where he architected the middle tier core analytics platform with open source olap engine (Mondrian) on J2EE and solved some complex dimensional ETL, modeling, and performance optimization problems.

Sumit has done his MS and BS in computer science.

Sumit has hiked to Mt. Everest Base Camp in October 2016.

www.PacktPub.com

For support files and downloads related to your book, please visit www.PacktPub.com.

Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.PacktPub.com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at [email protected] for more details.

At www.PacktPub.com, you can also read a collection of free technical articles, sign up for a range of free newsletters and receive exclusive discounts and offers on Packt books and eBooks.

https://www.packtpub.com/mapt

Get the most in-demand software skills with Mapt. Mapt gives you full access to all Packt books and video courses, as well as industry-leading tools to help you plan your personal development and advance your career.

Why subscribe?

Fully searchable across every book published by PacktCopy and paste, print, and bookmark contentOn demand and accessible via a web browser

Customer Feedback

Thanks for purchasing this Packt book. At Packt, quality is at the heart of our editorial process. To help us improve, please leave us an honest review on this book's Amazon page at https://www.amazon.in/Mastering-Spark-Science-Andrew-Morgan-ebook/dp/B01BWNXA82?_encoding=UTF8&keywords=mastering%20spark%20for%20data%20science&qid=1490239942&ref_=sr_1_1&sr=8-1.

If you'd like to join our team of regular reviewers, you can e-mail us at [email protected]. We award our regular reviewers with free eBooks and videos in exchange for their valuable feedback. Help us be relentless in improving our products!

Preface

The purpose of data science is to transform the world using data, and this goal is mainly achieved through disrupting and changing real processes in real industries. To operate at that level we need to be able to build data science solutions of substance; ones that solve real problems, and which can run reliably enough for people to trust and act upon.

This book explains how to use Spark to deliver production grade data science solutions that are innovative, disruptive, and reliable enough to be trusted. Whilst writing this book it was the authors’ intention to deliver a work that transcends the traditional cookbook style: providing not just examples of code, but developing the techniques and mind-set that are needed to explore content like a master; as they say, Content is King! Readers will notice that the book has a heavy emphasis on news analytics, and occasionally pulls in other datasets such as Tweets and financial data. This emphasis on news is not an accident; much effort has been spent on trying to focus on datasets that offer context at a global scale.

The implicit problem that this book is dedicated to is the lack of data offering proper context around how and why people make decisions. Often, directly accessible data sources are very focused on problem specifics and, as a consequence, can be very light on broader datasets offering the behavioral context needed to really understand what’s driving the decisions that people make.

Considering a simple example where website users’ key information such as age, gender, location, shopping behavior, purchases and so on are known, we might use this data to recommend products based on what others “like them” have been buying.

But to be exceptional, more context is required as to why people behave as they do. When news reports suggest a massive Atlantic hurricane is approaching the Florida coastline, and could reach the coast in say 36 hours, perhaps we should be recommending products people might need. Items such as USB enabled battery packs for keeping phones charged, candles, flashlights, water purifiers, and the like. By understanding the context in which decisions are being made, we can conduct better science.

Therefore, whilst this book certainly contains useful code and, in many cases, unique implementations, it further dives deep into the techniques and skills required to truly master data science; some of which are often overlooked or not considered at all. Drawing on many years of commercial experience, the authors have leveraged their extensive knowledge to bring the real, and exciting world of data science to life.

What this book covers

Chapter 1, The Big Data Science Ecosystem, this chapter is an introduction to an approach and accompanying ecosystem for achieving success with data at scale. It focuses on the data science tools and technologies that will be used in later chapters as well as introducing the environment and how to configure it appropriately. Additionally it explains some of the non-functional considerations relevant to the overall data architecture and long-term success.

Chapter 2, Data Acquisition, as a data scientist, one of the most important tasks is to accurately load data into a data science platform. Rather than having uncontrolled, ad hoc processes, this chapter explains how a general data ingestion pipeline in Spark can be constructed that serves as a reusable component across many feeds of input data.

Chapter 3, Input Formats and Schema, this chapter demonstrates how to load data from its raw format onto different schemas, therefore enabling a variety of different kinds of downstream analytics to be run over the same data. With this in mind, we will look at the traditionally well-understood area of data schemas. We will cover key areas of traditional database modeling and explain how some of these cornerstone principles are still applicable to Spark today. In addition, whilst honing our Spark skills we will analyze the GDELT data model and show how to store this large dataset in an efficient and scalable manner.

Chapter 4, Exploratory Data Analysis, a common misconception is that an EDA is only for discovering the statistical properties of a dataset and providing insights about how it can be exploited. In practice, this isn’t the full story. A full EDA will extend that idea, and include a detailed assessment of the “feasibility of using this Data Feed in production.” It requires us to also understand how we would specify a production grade data loading routine for this dataset, one that might potentially run in a “lights out mode” for many years. This chapter offers a rapid method for doing Data Quality assessment using a “data profiling” technique to accelerate the process.

Chapter 5, Spark for Geographic Analysis, geographic processing is a powerful new use case for Spark, and this chapter demonstrates how to get started. The aim of this chapter is to explain how Data Scientists can process geographic data, using Spark, to produce powerful map based views of very large datasets. We demonstrate how to process spatio-temporal datasets easily via Spark integrations with Geomesa, which helps turn Spark into a sophisticated geographic processing engine. The chapter later leverages this spatio-temporal data to apply machine learning with a view to predicting oil prices.

Chapter 6, Scraping Link-Based External Data, this chapter aims to explain a common pattern for enhancing local data with external content found at URLs or over APIs, such as GDELT and Twitter. We offer a tutorial using the GDELT news index service as a source of news URLS, demonstrating how to build a web scale News Scanner that scrapes global breaking news of interest from the internet. We further explain how to use the specialist web-scraping component in a way that overcomes the challenges of scale, followed by the summary of this chapter.

Chapter 7, Building Communities, this chapter aims to address a common use case in data science and big data. With more and more people interacting together, communicating, exchanging information, or simply sharing a common interest in different topics, the entire world can be represented as a Graph. A data scientist must be able to detect communities, find influencers / top contributors, and detect possible anomalies.

Chapter 8, Building a Recommendation System, if one were to choose an algorithm to showcase data science to the public, a recommendation system would certainly be in the frame. Today, recommendation systems are everywhere; the reason for their popularity is down to their versatility, usefulness and broad applicability. In this chapter, we will demonstrate how to recommend music content using raw audio signals.

Chapter 9, News Dictionary and Real-Time Tagging System, while a hierarchical data warehouse stores data in files of folders, a typical Hadoop based system relies on a flat architecture to store your data. Without a proper data governance or a clear understanding of what your data is all about, there is an undeniable chance of turning data lakes into swamps, where an interesting dataset such as GDELT would be nothing more than a folder containing a vast amount of unstructured text files. In this chapter, we will be describing an innovative way of labeling incoming GDELT data in a non-supervised way and in near real time.

Chapter 10, Story De-duplication and Mutation, in this chapter, we de-duplicate and index the GDELT database into stories, before tracking stories over time and understanding the links between them, how they may mutate and if they could lead to any subsequent events in the near future. Core to this chapter is the concept of Simhash to detect near duplicates and building vectors to reduce dimensionality using Random Indexing.

Chapter 11, Anomaly Detection and Sentiment Analysis, perhaps the most notable occurrence of the year 2016 was the tense US presidential election and its eventual outcome: the election of President Donald Trump, a campaign that will long be remembered; not least for its unprecedented use of social media and the stirring up of passion among its users, most of whom made their feelings known through the use of hashtags. In this chapter, instead of trying to predict the outcome itself, we will aim to detect abnormal tweets during the US election using a real-time Twitter feed.

Chapter 12, TrendCalculus, long before the concept of “what’s trending” became a popular topic of study by data scientists, there was an older one that is still not well served by data science; it is that of Trends. Presently, the analysis of trends, if it can be called that, is primarily carried out by people “eyeballing” time series charts and offering interpretations. But what is it that people’s eyes are doing? This chapter describes an implementation in Apache Spark of a new algorithm for studying trends numerically: TrendCalculus.

Chapter 13, Secure Data, throughout this book we visit many areas of data science, often straying into those that are not traditionally associated with a data scientist’s core working knowledge. In this chapter we will visit another of those often overlooked fields, Secure Data; more specifically, how to protect your data and analytic results at all stages of the data life cycle. Core to this chapter is the construction of a commercial grade encryption codec for Spark.

Chapter 14, Scalable Algorithms, in this chapter we learn about why sometimes even basic algorithms, despite working at small scale, will often fail in “big data”. We’ll see how to avoid issues when writing Spark jobs that run over massive Datasets and will learn about the structure of algorithms and how to write custom data science analytics that scale over petabytes of data. The chapter features areas such as: parallelization strategies, caching, shuffle strategies, garbage collection optimization and probabilistic models; explaining how these can help you to get the most out of the Spark paradigm.

What you need for this book

Spark 2.0 is used throughout the book along with Scala 2.11, Maven and Hadoop. This is the basic environment required, there are many other technologies used which are introduced in the relevant chapters.

Who this book is for

We presume that the data scientists reading this book are knowledgeable about data science, common machine learning methods, and popular data science tools, and have in the course of their work run proof of concept studies, and built prototypes. We offer a book that introduces advanced techniques and methods for building data science solutions to this audience, showing them how to construct commercial grade data products.

Reader feedback

Feedback from our readers is always welcome. Let us know what you think about this book-what you liked or disliked. Reader feedback is important for us as it helps us develop titles that you will really get the most out of. To send us general feedback, simply e-mail [email protected], and mention the book's title in the subject of your message. If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, see our author guide at www.packtpub.com/authors.

Customer support

Now that you are the proud owner of a Packt book, we have a number of things to help you to get the most from your purchase.

Downloading the example code

You can download the example code files for this book from your account at http://www.packtpub.com. If you purchased this book elsewhere, you can visit  http://www.packtpub.com/support and register to have the files e-mailed directly to you.

You can download the code files by following these steps:

Log in or register to our website using your e-mail address and password.Hover the mouse pointer on the SUPPORT tab at the top.Click on Code Downloads & Errata.Enter the name of the book in the Search box.Select the book for which you're looking to download the code files.Choose from the drop-down menu where you purchased this book from.Click on Code Download.

Once the file is downloaded, please make sure that you unzip or extract the folder using the latest version of:

WinRAR / 7-Zip for WindowsZipeg / iZip / UnRarX for Mac7-Zip / PeaZip for Linux

The code bundle for the book is also hosted on GitHub at https://github.com/PacktPublishing/Mastering-Spark-for-Data-Science. We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/. Check them out!

Downloading the color images of this book

We also provide you with a PDF file that has color images of the screenshots/diagrams used in this book. The color images will help you better understand the changes in the output. You can download this file from https://www.packtpub.com/sites/default/files/downloads/MasteringSparkforDataScience_ColorImages.pdf.

Errata

Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you find a mistake in one of our books-maybe a mistake in the text or the code-we would be grateful if you could report this to us. By doing so, you can save other readers from frustration and help us improve subsequent versions of this book. If you find any errata, please report them by visiting http://www.packtpub.com/submit-errata, selecting your book, clicking on the Errata Submission Form link, and entering the details of your errata. Once your errata are verified, your submission will be accepted and the errata will be uploaded to our website or added to any list of existing errata under the Errata section of that title.

To view the previously submitted errata, go to https://www.packtpub.com/books/content/support and enter the name of the book in the search field. The required information will appear under the Errata section.

Piracy

Piracy of copyrighted material on the Internet is an ongoing problem across all media. At Packt, we take the protection of our copyright and licenses very seriously. If you come across any illegal copies of our works in any form on the Internet, please provide us with the location address or website name immediately so that we can pursue a remedy.

Please contact us at [email protected] with a link to the suspected pirated material.

We appreciate your help in protecting our authors and our ability to bring you valuable content.

Questions

If you have a problem with any aspect of this book, you can contact us at [email protected], and we will do our best to address the problem.

Chapter 1.  The Big Data Science Ecosystem

As a data scientist, you'll no doubt be very familiar with handling files and processing perhaps even large amounts of data. However, as I'm sure you will agree, doing anything more than a simple analysis over a single type of data requires a method of organizing and cataloguing data so that it can be managed effectively. Indeed, this is the cornerstone of a great data scientist. As the data volume and complexity increases, a consistent and robust approach can be the difference between generalized success and over-fitted failure!

This chapter is an introduction to an approach and ecosystem for achieving success with data at scale. It focuses on the data science tools and technologies. It introduces the environment, and how to configure it appropriately, but also explains some of the nonfunctional considerations relevant to the overall data architecture. While there is little actual data science at this stage, it provides the essential platform to pave the way for success in the rest of the book.

In this chapter, we will cover the following topics:

Data management responsibilitiesData architectureCompanion tools

Introducing the Big Data ecosystem

Data management is of particular importance, especially when the data is in flux; either constantly changing or being routinely produced and updated. What is needed in these cases is a way of storing, structuring, and auditing data that allows for the continuous processing and refinement of models and results.

Here, we describe how to best hold and organize your data to integrate with Apache Spark and related tools within the context of a data architecture that is broad enough to fit the everyday requirement.

Data management

Even if, in the medium term, you only intend to play around with a bit of data at home; then without proper data management, more often than not, efforts will escalate to the point where it is easy to lose track of where you are and mistakes will happen. Taking the time to think about the organization of your data, and in particular, its ingestion, is crucial. There's nothing worse than waiting for a long running analytic to complete, collating the results and producing a report, only to discover you used the wrong version of data, or data is incomplete, has missing fields, or even worse you deleted your results!

The bad news is that, despite its importance, data management is an area that is consistently overlooked in both commercial and non-commercial ventures, with precious few off-the-shelf solutions available. The good news is that it is much easier to do great data science using the fundamental building blocks that this chapter describes.

Data management responsibilities

When we think about data, it is easy to overlook the true extent of the scope of the areas we need to consider. Indeed, most data "newbies" think about the scope in this way:

Obtain dataPlace the data somewhere (anywhere)Use the dataThrow the data away

In reality, there are a large number of other considerations, it is our combined responsibility to determine which ones apply to a given work piece. The following data management building blocks assist in answering or tracking some important questions about the data:

File integrity
Is the data file complete?How do you know?Was it part of a set?Is the data file correct?Was it tampered with in transit?
Data integrity
Is the data as expected?Are all of the fields present?Is there sufficient metadata?Is the data quality sufficient?Has there been any data drift?
Scheduling
Is the data routinely transmitted?How often does the data arrive?Was the data received on time?Can you prove when the data was received?Does it require acknowledgement?
Schema management
Is the data structured or unstructured?How should the data be interpreted?Can the schema be inferred?Has the data changed over time?Can the schema be evolved from the previous version?
Version Management
What is the version of the data?Is the version correct?How do you handle different versions of the data?How do you know which version you're using?
Security
Is the data sensitive?Does it contain personally identifiable information (PII)?Does it contain personal health information (PHI)?Does it contain payment card information (PCI)?How should I protect the data?Who is entitled to read/write the data?Does it require anonymization/sanitization/obfuscation/encryption?
Disposal
How do we dispose of the data?When do we dispose of the data?

If, after all that, you are still not convinced, before you go ahead and write that bash script using the gawk and crontab commands, keep reading and you will soon see that there is a far quicker, flexible, and safer method that allow you to start small and incrementally create commercial grade ingestion pipelines!

The right tool for the job

Apache Spark is the emerging de facto standard for scalable data processing. At the time of writing this book, it is the most active Apache Software Foundation (ASF) project and has a rich variety of companion tools available. There are new projects appearing every day, many of which overlap in functionality. So it takes time to learn what they do and decide whether they are appropriate to use. Unfortunately, there's no quick way around this. Usually, specific trade-offs must be made on a case-by-case basis; there is rarely a one-size-fits-all solution. Therefore, the reader is encouraged to explore the available tools and choose wisely!

Various technologies are introduced throughout this book, and the hope is that they will provide the reader with a taster of some of the more useful and practical ones to a level where they may start utilizing them in their own projects. And further, we hope to show that if the code is written carefully, technologies may be interchanged through clever use of Application Program Interface (APIs) (or high order functions in Spark Scala) even when a decision is proved to be incorrect.

Overall architecture

Let's start with a high-level introduction to data architectures: what they do, why they're useful, when they should be used, and how Apache Spark fits in.

At their most general, modern data architectures have four basic characteristics:

Data IngestionData LakeData ScienceData Access

Let's introduce each of these now, so that we can go into more detail in the later chapters.

Data Ingestion

Traditionally, data is ingested under strict rules and formatted according to a predetermined schema. This process is known as Extract, Transform, Load (ETL), and is still a very common practice supported by a large array of commercial tools as well as some open source products.

The ETL approach favors performing up-front checks, which ensure data quality and schema conformance, in order to simplify follow-on online analytical processing. It is particularly suited to handling data with a specific set of characteristics, namely, those that relate to a classical entity-relationship model. However, it is not suitable for all scenarios.

During the big data revolution, there was a metaphorical explosion of demand for structured, semi-structured, and unstructured data, leading to the creation of systems that were required to handle data with a different set of characteristics. These came to be defined by the phrase, 4 Vs: Volume, Variety, Velocity, and Veracityhttp://www.ibmbigdatahub.com/infographic/four-vs-big-data. While traditional ETL methods floundered under this new burden-because they simply required too much time to process the vast quantities of data, or were too rigid in the face of change, a different approach emerged. Enter the schema-on-read paradigm. Here, data is ingested in its original form (or at least very close to) and the details of normalization, validation, and so on are done at the time of analytical processing.

This is typically referred to as Extract Load Transform (ELT), a reference to the traditional approach:

This approach values the delivery of data in a timely fashion, delaying the detailed processing until it is absolutely required. In this way, a data scientist can gain access to the data immediately, searching for insight using a range of techniques not available with a traditional approach.

Although we only provide a high-level overview here, this approach is so important that throughout the book we will explore further by implementing various schema-on-read algorithms. We will assume the ELT method for data ingestion, that is to say we encourage the loading of data at the user's convenience. This may be every n minute, overnight or during times of low usage. The data can then be checked for integrity, quality, and so forth by running batch processing jobs offline, again at the user's discretion.

Data Lake

A data lake is a convenient, ubiquitous store of data. It is useful because it provides a number of key benefits, primarily:

Reliable storageScalable data processing capability

Let's take a brief look at each of these.

Reliable storage

There is a good choice of underlying storage implementations for a data lake, these include Hadoop Distributed File System (HDFS), MapR-FS, and Amazon AWS S3.

Throughout the book, HDFS will be the assumed storage implementation. Also, in this book the authors use a distributed Spark setup, deployed on Yet Another Resource Negotiator (YARN) running inside a Hortonworks HDP environment. Therefore, HDFS is the technology used, unless otherwise stated. If you are not familiar with any of these technologies, they are discussed further on in this chapter.

In any case, it's worth knowing that Spark references HDFS locations natively, accesses local file locations via the prefix file:// and references S3 locations via the prefix s3a://.

Scalable data processing capability

Clearly, Apache Spark will be our data processing platform of choice. In addition, as you may recall, Spark allows the user to execute code in their preferred environment, be that local, standalone, YARN or Mesos, by configuring the appropriate cluster manager; in masterURL. Incidentally, this can be done in any one of the three locations:

Using the --master option when issuing the spark-submit commandAdding the spark.master property in the conf/spark-defaults.conf fileInvoking the setMaster method on the SparkConf object

If you're not familiar with HDFS, or if you do not have access to a cluster, then you can run a local Spark instance using the local filesystem, which is useful for testing. However, beware that there are often bad behaviors that only appear when executing on a cluster. So, if you're serious about Spark, it's worth investing in a distributed cluster manager why not try Spark standalone cluster mode, or Amazon AWS EMR? For example, Amazon offers a number of affordable paths to cloud computing, you can explore the idea of spot instances at https://aws.amazon.com/ec2/spot/.

Data science platform

A data science platform provides services and APIs that enable effective data science to take place, including explorative data analysis, machine learning model creation and refinement, image and audio processing, natural language processing, and text sentiment analysis.

This is the area where Spark really excels and forms the primary focus of the remainder of this book, exploiting a robust set of native machine learning libraries, unsurpassed parallel graph processing capabilities and a strong community. Spark provides truly scalable opportunities for data science.

The remaining chapters will provide insight into each of these areas, including Chapter 6, Scraping Link-Based External Data, Chapter 7, Building Communities, and Chapter 8, Building a Recommendation System.

Data Access

Data in a data lake is most frequently accessed by data engineers and scientists using the Hadoop ecosystem tools, such as Apache Spark, Pig, Hive, Impala, or Drill. However, there are times when other users, or even other systems, need access to the data and the normal tools are either too technical or do not meet the demanding expectations of the user in terms of real-world latency.

In these circumstances, the data often needs to be copied into data marts or index stores so that it may be exposed to more traditional methods, such as a report or dashboard. This process, which typically involves creating indexes and restructuring data for low-latency access, is known as data egress.

Fortunately, Apache Spark has a wide variety of adapters and connectors into traditional databases, BI tools, and visualization and reporting software. Many of these will be introduced throughout the book.

Data technologies

When Hadoop first started, the word Hadoop referred to the combination of HDFS and the MapReduce processing paradigm, as that was the outline of the original paper http://research.google.com/archive/mapreduce.html. Since that time, a plethora of technologies have emerged to complement Hadoop, and with the development of Apache YARN we now see other processing paradigms emerge such as Spark.

Hadoop is now often used as a colloquialism for the entire big data software stack and so it would be prudent at this point to define the scope of that stack for this book. The typical data architecture with a selection of technologies we will visit throughout the book is detailed as follows:

The relationship between these technologies is a dense topic as there are complex interdependencies, for example, Spark depends on GeoMesa, which depends on Accumulo, which depends on Zookeeper and HDFS! Therefore, in order to manage these relationships, there are platforms available, such as Cloudera or Hortonworks HDP http://hortonworks.com/products/sandbox/. These provide consolidated user interfaces and centralized configuration. The choice of platform is that of the reader, however, it is not recommended to install a few of the technologies initially and then move to a managed platform as the version problems encountered will be very complex. Therefore, it is usually easier to start with a clean machine and make a decision upfront as to which direction to take.

All of the software we use in this book is platform-agnostic and therefore fits into the general architecture described earlier. It can be installed independently and it is relatively straightforward to use with single or multiple server environment without the use of a managed product.

The role of Apache Spark

In many ways, Apache Spark is the glue that holds these components together. It increasingly represents the hub of the software stack. It integrates with a wide variety of components but none of them are hard-wired. Indeed, even the underlying storage mechanism can be swapped out. Combining this feature with the ability to leverage different processing frameworks means the original Hadoop technologies effectively become components, rather than an imposing framework. The logical diagram of our architecture appears as follows:

As Spark has gained momentum and wide-scale industry acceptance, many of the original Hadoop implementations for various components have been refactored for Spark. Thus, to add further complexity to the picture, there are often several possible ways to programmatically leverage any particular component; not least the imperative and declarative versions depending upon whether an API has been ported from the original Hadoop Java implementation. We have attempted to remain as true as possible to the Spark ethos throughout the remaining chapters.