E-Book
44,39 €

Mastering Spark for Data Science E-Book

Andrew Morgan

0,0

44,39 €

Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.

Mehr erfahren.

Herausgeber: Packt Publishing
Kategorie: Lebensstil
Sprache: Englisch

Beschreibung

Data science seeks to transform the world using data, and this is typically achieved
through disrupting and changing real processes in real industries. In order to operate at this level you need to build data science solutions of substance –solutions that solve real problems. Spark has emerged as the big data platform of choice for data scientists due to its speed, scalability, and easy-to-use APIs.
This book deep dives into using Spark to deliver production-grade data science
solutions. This process is demonstrated by exploring the construction of a sophisticated global news analysis service that uses Spark to generate continuous geopolitical and current affairs insights.You will learn all about the core Spark APIs and take a comprehensive tour of advanced libraries, including Spark SQL, Spark Streaming, MLlib, and more.

You will be introduced to advanced techniques and methods that will help you to construct commercial-grade data products. Focusing on a sequence of tutorials that deliver a working news intelligence service, you will learn about advanced Spark architectures, how to work with geographic data in Spark, and how to tune Spark algorithms so they scale linearly.

Details

Das E-Book können Sie in Legimi-Apps oder einer beliebigen App lesen, die das folgende Format unterstützen:

EPUB

MOBI

Seitenzahl: 616

Veröffentlichungsjahr: 2017

Bewertungen

0,0

Rezensionen(0 Rezensionen)

Leseprobe

Mastering Spark for Data Science

Credits

Foreword

About the Authors

About the Reviewer

www.PacktPub.com

Why subscribe?

Customer Feedback

Preface

What this book covers

What you need for this book

Who this book is for

Conventions

Reader feedback

Customer support

Downloading the example code

Downloading the color images of this book

Errata

Piracy

Questions

1. The Big Data Science Ecosystem

Introducing the Big Data ecosystem

Data management

Data management responsibilities

The right tool for the job

Overall architecture

Data Ingestion

Data Lake

Reliable storage

Scalable data processing capability

Data science platform

Data Access

Data technologies

The role of Apache Spark

Companion tools

Apache HDFS

Advantages

Disadvantages

Installation

Amazon S3

Advantages

Disadvantages

Installation

Apache Kafka

Advantages

Disadvantages

Installation

Apache Parquet

Advantages

Disadvantages

Installation

Apache Avro

Advantages

Disadvantages

Installation

Apache NiFi

Advantages

Disadvantages

Installation

Apache YARN

Advantages

Disadvantages

Installation

Apache Lucene

Advantages

Disadvantages

Installation

Kibana

Advantages

Disadvantages

Installation

Elasticsearch

Advantages

Disadvantages

Installation

Accumulo

Advantages

Disadvantages

Installation

Summary

2. Data Acquisition

Data pipelines

Universal ingestion framework

Introducing the GDELT news stream

Discovering GDELT in real-time

Our first GDELT feed

Improving with publish and subscribe

Content registry

Choices and more choices

Going with the flow

Metadata model

Kibana dashboard

Quality assurance

Example 1 - Basic quality checking, no contending users

Example 2 - Advanced quality checking, no contending users

Example 3 - Basic quality checking, 50% utility due to contending users

Summary

3. Input Formats and Schema

A structured life is a good life

GDELT dimensional modeling

GDELT model

First look at the data

Core global knowledge graph model

Hidden complexity

Denormalized models

Challenges with flattened data

Issue 1 - Loss of contextual information

Issue 2: Re-establishing dimensions

Issue 3: Including reference data

Loading your data

Schema agility

Reality check

GKG ELT

Position matters

Avro

Spark-Avro method

Pedagogical method

When to perform Avro transformation

Parquet

Summary

4. Exploratory Data Analysis

The problem, principles and planning

Understanding the EDA problem

Design principles

General plan of exploration

Preparation

Introducing mask based data profiling

Introducing character class masks

Building a mask based profiler

Setting up Apache Zeppelin

Constructing a reusable notebook

Exploring GDELT

GDELT GKG datasets

The files

Special collections

Reference data

Exploring the GKG v2.1

The Translingual files

A configurable GCAM time series EDA

Plot.ly charting on Apache Zeppelin

Exploring translation sourced GCAM sentiment with plot.ly

Concluding remarks

A configurable GCAM Spatio-Temporal EDA

Introducing GeoGCAM

Does our spatial pivot work?

Summary

5. Spark for Geographic Analysis

GDELT and oil

GDELT events

GDELT GKG

Formulating a plan of action

GeoMesa

Installing

GDELT Ingest

GeoMesa Ingest

MapReduce to Spark

Geohash

GeoServer

Map layers

CQL

Gauging oil prices

Using the GeoMesa query API

Data preparation

Machine learning

Naive Bayes

Results

Analysis

Summary

6. Scraping Link-Based External Data

Building a web scale news scanner

Accessing the web content

The Goose library

Integration with Spark

Scala compatibility

Serialization issues

Creating a scalable, production-ready library

Build once, read many

Exception handling

Performance tuning

Named entity recognition

Scala libraries

NLP walkthrough

Extracting entities

Abstracting methods

Building a scalable code

Build once, read many

Scalability is also a state of mind

Performance tuning

GIS lookup

GeoNames dataset

Building an efficient join

Offline strategy - Bloom filtering

Online strategy - Hash partitioning

Content deduplication

Context learning

Location scoring

Names de-duplication

Functional programming with Scalaz

Our de-duplication strategy

Using the mappend operator

Simple clean

DoubleMetaphone

News index dashboard

Summary

7. Building Communities

Building a graph of persons

Contact chaining

Extracting data from Elasticsearch

Using the Accumulo database

Setup Accumulo

Cell security

Iterators

Elasticsearch to Accumulo

A graph data model in Accumulo

Hadoop input and output formats

Reading from Accumulo

AccumuloGraphxInputFormat and EdgeWritable

Building a graph

Community detection algorithm

Louvain algorithm

Weighted Community Clustering (WCC)

Description

Preprocessing stage

Initial communities

Message passing

Community back propagation

WCC iteration

Gathering community statistics

WCC Computation

WCC iteration

GDELT dataset

The Bowie effect

Smaller communities

Using Accumulo cell level security

Summary

8. Building a Recommendation System

Different approaches

Collaborative filtering

Content-based filtering

Custom approach

Uninformed data

Processing bytes

Creating a scalable code

From time to frequency domain

Fast Fourier transform

Sampling by time window

Extracting audio signatures

Building a song analyzer

Selling data science is all about selling cupcakes

Using Cassandra

Using the Play framework

Building a recommender

The PageRank algorithm

Building a Graph of Frequency Co-occurrence

Running PageRank

Building personalized playlists

Expanding our cupcake factory

Building a playlist service

Leveraging the Spark job server

User interface

Summary

9. News Dictionary and Real-Time Tagging System

The mechanical Turk

Human intelligence tasks

Bootstrapping a classification model

Learning from Stack Exchange

Building text features

Training a Naive Bayes model

Laziness, impatience, and hubris

Designing a Spark Streaming application

A tale of two architectures

The CAP theorem

The Greeks are here to help

Importance of the Lambda architecture

Importance of the Kappa architecture

Consuming data streams

Creating a GDELT data stream

Creating a Kafka topic

Publishing content to a Kafka topic

Consuming Kafka from Spark Streaming

Creating a Twitter data stream

Processing Twitter data

Extracting URLs and hashtags

Keeping popular hashtags

Expanding shortened URLs

Fetching HTML content

Using Elasticsearch as a caching layer

Classifying data

Training a Naive Bayes model

Thread safety

Predict the GDELT data

Our Twitter mechanical Turk

Summary

10. Story De-duplication and Mutation

Detecting near duplicates

First steps with hashing

Standing on the shoulders of the Internet giants

Simhashing

The hamming weight

Detecting near duplicates in GDELT

Indexing the GDELT database

Persisting our RDDs

Building a REST API

Area of improvement

Building stories

Building term frequency vectors

The curse of dimensionality, the data science plague

Optimizing KMeans

Story mutation

The Equilibrium state

Tracking stories over time

Building a streaming application

Streaming KMeans

Visualization

Building story connections

Summary

11. Anomaly Detection on Sentiment Analysis

Following the US elections on Twitter

Acquiring data in stream

Acquiring data in batch

The search API

Rate limit

Analysing sentiment

Massaging Twitter data

Using the Stanford NLP

Building the Pipeline

Using Timely as a time series database

Storing data

Using Grafana to visualize sentiment

Number of processed tweets

Give me my Twitter account back

Identifying the swing states

Twitter and the Godwin point

Learning context

Visualizing our model

Word2Graph and Godwin point

Building a Word2Graph

Random walks

A Small Step into sarcasm detection

Building features

#LoveTrumpsHates

Scoring Emojis

Training a KMeans model

Detecting anomalies

Summary

12. TrendCalculus

Studying trends

The TrendCalculus algorithm

Trend windows

Simple trend

User Defined Aggregate Functions

Simple trend calculation

Reversal rule

Introducing the FHLS bar structure

Visualize the data

FHLS with reversals

Edge cases

Zero values

Completing the gaps

Stackable processing

Practical applications

Algorithm characteristics

Advantages

Disadvantages

Possible use cases

Chart annotation

Co-trending

Data reduction

Indexing

Fractal dimension

Streaming proxy for piecewise linear regression

Summary

13. Secure Data

Data security

The problem

The basics

Authentication and authorization

Access control lists (ACL)

Role-based access control (RBAC)

Access

Encryption

Data at rest

Java KeyStore

S3 encryption

Data in transit

Obfuscation/Anonymizing

Masking

Tokenization

Using a Hybrid approach

Data disposal

Kerberos authentication

Use case 1: Apache Spark accessing data in secure HDFS

Use case 2: extending to automated authentication

Use case 3: connecting to secure databases from Spark

Security ecosystem

Apache sentry

RecordService

Apache ranger

Apache Knox

Your Secure Responsibility

Summary

14. Scalable Algorithms

General principles

Spark architecture

History of Spark

Moving parts

Driver

SparkSession

Resilient distributed datasets (RDDs)

Executor

Shuffle operation

Cluster Manager

Task

DAG

DAG scheduler

Transformations

Stages

Actions

Task scheduler

Challenges

Algorithmic complexity

Numerical anomalies

Shuffle

Data schemes

Plotting your course

Be iterative

Data preparation

Scale up slowly

Estimate performance

Step through carefully

Tune your analytic

Design patterns and techniques

Spark APIs

Problem

Solution

Example

Summary pattern

Problem

Solution

Example

Expand and Conquer Pattern

Problem

Solution

Lightweight Shuffle

Problem

Solution

Wide Table pattern

Problem

Solution

Example

Broadcast variables pattern

Problem

Solution

Creating a broadcast variable

Accessing a broadcast variable

Removing a broadcast variable

Example

Combiner pattern

Problem

Solution

Example

Optimized cluster

Problem

Solution

Redistribution pattern

Problem

Solution

Example

Salting key pattern

Problem

Solution

Secondary sort pattern

Problem

Solution

Example

Filter overkill pattern

Problem

Solution

Probabilistic algorithms

Problem

Solution

Example

Selective caching

Problem

Solution

Garbage collection

Problem

Solution

Graph traversal

Problem

Solution

Example

Summary

Mastering Spark for Data Science

All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the authors, nor Packt Publishing, and its dealers and distributors will be held liable for any damages caused or alleged to be caused directly or indirectly by this book.

Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

First published: March 2017

Production reference: 1240317

Published by Packt Publishing Ltd.

Livery Place

35 Livery Street

Birmingham

B3 2PB, UK.

ISBN 978-1-78588-214-2

www.packtpub.com

Credits

Authors

Andrew Morgan

Antoine Amend

David George

Matthew Hallett

Copy Editor

Safis Editing

Reviewer

Sumit Pal

Project Coordinator

Shweta H Birwatkar

Commissioning Editor

Akram Hussain

Proofreader

Safis Editing

Acquisition Editor

Vinay Argekar

Indexer

Pratik Shirodkar

Content Development Editor

Amrita Noronha

Graphics

Tania Dutta

Technical Editor

Sneha Hanchate

Production Coordinator

Arvindkumar Gupta

Foreword

The impact of Spark on the world of data science has been startling. It is less than 3 years since Spark 1.0 was released and yet Spark is already accepted as the omni-competent kernel of any big data architecture. We adopted Spark as our core technology at Barclays around this time and this was considered a bold (read ‘rash’) move. Now it is taken as a given that Spark is your starting point for any big data science project.

As data science has developed both as an activity and as an accepted term, there has been much talk about the unicorn data scientist. This is the unlikely character who can do both the maths and the coding. They are apparently hard to find, and harder to keep. My team likes to think more in terms of three data science competencies: pattern recognition, distributed computation, and automation. If data science is about exploiting insights from data in production, then you need to be able to develop applications with these three competencies in mind from the start. There is no point using a machine learning methodology that won’t scale with your data, or building an analytical kernel that needs to be re-coded to be production quality. And so you need either a unicorn or a unicorn-team (my preference) to do the work.

Spark is your unicorn technology. No other language not only expresses analytical concepts elegantly but also moves effortlessly from the small scale to big data, and so naturally facilitates production-ready code as Spark (with the Scala API). We chose Spark because we could compose a model in a few lines, run the same code on the cluster as we had tried out on the laptop, and build robust unit-tested JVM applications that we could be confident would run in business-critical use cases. The combination of functional programming in Scala with the Spark abstractions is uniquely powerful, and choosing it has been a significant cause of the success of the team over the last 3 years.

So here's the conundrum. Why are there no books which present Spark in this way, recognizing that one of the best reasons to work in Spark is its application to production data science? If you scan the bookshelves (or look at tutorials online) all you will find is toy models and a review of the Spark APIs and libs. You will find little or nothing about how Spark fits into the wider architecture, or about how to manage data ETL in a sustainable way.

I think you will find that the practical approach taken by the authors in this book is different. Each chapter takes on a new challenge, and each reads as a voyage of discovery where the outcome was not necessarily known in advance of the exploration. And the value of doing data science properly is set out clearly from the start. This is one of the first books on Spark for grown-ups who want to do real data science that will make an impact on their organisation. I hope you enjoy it.

Harry Powell

Head of Advanced Analytics, Barclays

About the Authors

Andrew Morgan is a specialist in data strategy and its execution, and has deep experience in the supporting technologies, system architecture, and data science that bring it to life. With over 20 years of experience in the data industry, he has worked designing systems for some of its most prestigious players and their global clients – often on large, complex and international projects. In 2013, he founded ByteSumo Ltd, a data science and big data engineering consultancy, and he now works with clients in Europe and the USA. Andrew is an active data scientist, and the inventor of the TrendCalculus algorithm. It was developed as part of his ongoing research project investigating long-range predictions based on machine learning the patterns found in drifting cultural, geopolitical and economic trends. He also sits on the Hadoop Summit EU data science selection committee, and has spoken at many conferences on a variety of data topics. He also enjoys participating in the Data Science and Big Data communities where he lives in London.

This book is dedicated to my wife Steffy, to my children Alice and Adele, and to all my friends and colleagues who have been endlessly supportive. It is also dedicated to the memory of one my earliest mentors whom I studied under at the University of Toronto, a Professor Ferenc Csillag. Back in 1994, Ferko inspired me with visions of a future where we could use planet-wide data collection and sophisticated algorithms to monitor and optimize the world around us. It was an idea that changed my life, and his dream of a world saved by Big Data Science, is one I’m still chasing.

Antoine Amend is a data scientist passionate about big data engineering and scalable computing. The book’s theme of torturing astronomical amounts of unstructured data to gain new insights mainly comes from his background in theoretical physics. Graduating in 2008 with a Msc. in Astrophysics, he worked for a large consultancy business in Switzerland before discovering the concept of big data at the early stages of Hadoop. He has embraced big data technologies ever since, and is now working as the Head of Data Science for cyber security at Barclays Bank. By combining a scientific approach with core IT skills, Antoine qualified two years running for the Big Data World Championships finals held in Austin TX. He Placed in the top 12 in both 2014 and 2015 edition (over 2000+ competitors) where he additionally won the Innovation Award using the methodologies and technologies explained in this book.

I would like to thank my wife for standing beside me, she has been my motivation for continuing to improve my knowledge and move my career forward. I thank my wonderful kids for always teaching me how to step back whenever it is necessary to clear my mind and get fresh new ideas.

I would like to extend my thanks to my co-workers, especially Dr. Samuel Assefa, Dr. Eirini Spyropoulou and Will Hardman for their patience listening to my crazy theories, and everyone else I had the pleasure to work with over the past few years. Finally, I want to address a special thanks to all my previous managers and mentors who helped me shape up my career in data and analytics; thanks to Manu, Toby, Gary and Harry.

David George is a distinguished distributed computing expert with 15+ years of data systems experience, mainly with globally recognized IT consultancies and brands. Working with core Hadoop technologies since the early days, he has delivered implementations at the largest scale. David always takes a pragmatic approach to software design and values elegance in simplicity.

Today he continues to work as a lead engineer, designing scalable applications for financial sector customers with some of the toughest requirements. His latest projects focus on the adoption of advanced AI techniques for increasing levels of automation across knowledge-based industries.

For Ellie, Shannon, Pauline and Pumpkin – here’s to the sequel!

Matthew Hallett is a Software Engineer and Computer Scientist with over 15 years of industry experience. He is an expert Object Oriented programmer and systems engineer with extensive knowledge of low level programming paradigms and, for the last 8 years, has developed an expertise in Hadoop and distributed programming within mission critical environments, comprising multithousandnode data centres. With consultancy experience in distributed algorithms and the implementation of distributed computing architectures, in a variety of languages, Matthew is currently a Consultant Data Engineer in the Data Science & Engineering team at a top four audit firm.

Lynnie, thanks for your understanding and sacrifices that afforded me the time during late nights, weekends and holidays to write this book. Nugget, you make it all worthwhile.

We would also like to thank Gary Richardson, Dr David Pryce, Dr Helen Ramsden, Dr Sima Reichenbach and Dr Fabio Petroni for their invaluable advice and guidance that has led to the completion of this huge project – without their help and contributions this book may never have been completed!

About the Reviewer

Sumit Pal is an author who has published SQL on Big Data - Technology, Architecture and Innovations with Apress. Sumit has more than 22 years of experience in the software industry in various roles spanning companies from startups to enterprises.

He is an independent consultant working with big data, data visualization, and data science and a software architect building end-to-end, data-driven analytic systems.

Sumit has worked for Microsoft (SQL server development team), Oracle (OLAP development team), and Verizon (Big Data analytics team) in a career spanning 22 years. Currently, he works for multiple clients advising them on their data architectures and big data solutions and does hands-on coding with Spark, Scala, Java, and Python.

Sumit has spoken at big data conferences in Boston, Chicago, Las Vegas, and Vancouver. Sumit is also the author of the book on the same topic published by Apress in October 2016.

He has extensive experience in building scalable systems across the stack from middletier, data tier, to visualization for analytics applications, using BigData and NoSQL DB. Sumit has deep expertise in DataBase Internals, Data Warehouses, Dimensional Modeling, Data Science with Java, Python, and SQL.

Sumit started his career being a part of SQLServer development team at Microsoft in 1996-97 and then as a core server engineer for Oracle Corporation at their OLAP development team in Burlington, MA.

Sumit has also worked at Verizon as an associate director for big data architecture, where he strategized, managed, architected, and developed platforms and solutions for analytics and machine learning applications.

Sumit has also served as a chief architect at ModelN/LeapfrogRX (2006-2013), where he architected the middle tier core analytics platform with open source olap engine (Mondrian) on J2EE and solved some complex dimensional ETL, modeling, and performance optimization problems.

Sumit has done his MS and BS in computer science.

Sumit has hiked to Mt. Everest Base Camp in October 2016.

www.PacktPub.com

For support files and downloads related to your book, please visit www.PacktPub.com.

Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.PacktPub.com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at [email protected] for more details.

At www.PacktPub.com, you can also read a collection of free technical articles, sign up for a range of free newsletters and receive exclusive discounts and offers on Packt books and eBooks.

https://www.packtpub.com/mapt

Get the most in-demand software skills with Mapt. Mapt gives you full access to all Packt books and video courses, as well as industry-leading tools to help you plan your personal development and advance your career.

Why subscribe?

Fully searchable across every book published by PacktCopy and paste, print, and bookmark contentOn demand and accessible via a web browser

Customer Feedback

Thanks for purchasing this Packt book. At Packt, quality is at the heart of our editorial process. To help us improve, please leave us an honest review on this book's Amazon page at https://www.amazon.in/Mastering-Spark-Science-Andrew-Morgan-ebook/dp/B01BWNXA82?_encoding=UTF8&keywords=mastering%20spark%20for%20data%20science&qid=1490239942&ref_=sr_1_1&sr=8-1.

If you'd like to join our team of regular reviewers, you can e-mail us at [email protected]. We award our regular reviewers with free eBooks and videos in exchange for their valuable feedback. Help us be relentless in improving our products!

Preface

The purpose of data science is to transform the world using data, and this goal is mainly achieved through disrupting and changing real processes in real industries. To operate at that level we need to be able to build data science solutions of substance; ones that solve real problems, and which can run reliably enough for people to trust and act upon.

This book explains how to use Spark to deliver production grade data science solutions that are innovative, disruptive, and reliable enough to be trusted. Whilst writing this book it was the authors’ intention to deliver a work that transcends the traditional cookbook style: providing not just examples of code, but developing the techniques and mind-set that are needed to explore content like a master; as they say, Content is King! Readers will notice that the book has a heavy emphasis on news analytics, and occasionally pulls in other datasets such as Tweets and financial data. This emphasis on news is not an accident; much effort has been spent on trying to focus on datasets that offer context at a global scale.

The implicit problem that this book is dedicated to is the lack of data offering proper context around how and why people make decisions. Often, directly accessible data sources are very focused on problem specifics and, as a consequence, can be very light on broader datasets offering the behavioral context needed to really understand what’s driving the decisions that people make.

Considering a simple example where website users’ key information such as age, gender, location, shopping behavior, purchases and so on are known, we might use this data to recommend products based on what others “like them” have been buying.

But to be exceptional, more context is required as to why people behave as they do. When news reports suggest a massive Atlantic hurricane is approaching the Florida coastline, and could reach the coast in say 36 hours, perhaps we should be recommending products people might need. Items such as USB enabled battery packs for keeping phones charged, candles, flashlights, water purifiers, and the like. By understanding the context in which decisions are being made, we can conduct better science.

Therefore, whilst this book certainly contains useful code and, in many cases, unique implementations, it further dives deep into the techniques and skills required to truly master data science; some of which are often overlooked or not considered at all. Drawing on many years of commercial experience, the authors have leveraged their extensive knowledge to bring the real, and exciting world of data science to life.

What this book covers

Chapter 1, The Big Data Science Ecosystem, this chapter is an introduction to an approach and accompanying ecosystem for achieving success with data at scale. It focuses on the data science tools and technologies that will be used in later chapters as well as introducing the environment and how to configure it appropriately. Additionally it explains some of the non-functional considerations relevant to the overall data architecture and long-term success.

Chapter 2, Data Acquisition, as a data scientist, one of the most important tasks is to accurately load data into a data science platform. Rather than having uncontrolled, ad hoc processes, this chapter explains how a general data ingestion pipeline in Spark can be constructed that serves as a reusable component across many feeds of input data.

Chapter 3, Input Formats and Schema, this chapter demonstrates how to load data from its raw format onto different schemas, therefore enabling a variety of different kinds of downstream analytics to be run over the same data. With this in mind, we will look at the traditionally well-understood area of data schemas. We will cover key areas of traditional database modeling and explain how some of these cornerstone principles are still applicable to Spark today. In addition, whilst honing our Spark skills we will analyze the GDELT data model and show how to store this large dataset in an efficient and scalable manner.

Chapter 4, Exploratory Data Analysis, a common misconception is that an EDA is only for discovering the statistical properties of a dataset and providing insights about how it can be exploited. In practice, this isn’t the full story. A full EDA will extend that idea, and include a detailed assessment of the “feasibility of using this Data Feed in production.” It requires us to also understand how we would specify a production grade data loading routine for this dataset, one that might potentially run in a “lights out mode” for many years. This chapter offers a rapid method for doing Data Quality assessment using a “data profiling” technique to accelerate the process.

Chapter 5, Spark for Geographic Analysis, geographic processing is a powerful new use case for Spark, and this chapter demonstrates how to get started. The aim of this chapter is to explain how Data Scientists can process geographic data, using Spark, to produce powerful map based views of very large datasets. We demonstrate how to process spatio-temporal datasets easily via Spark integrations with Geomesa, which helps turn Spark into a sophisticated geographic processing engine. The chapter later leverages this spatio-temporal data to apply machine learning with a view to predicting oil prices.

Chapter 6, Scraping Link-Based External Data, this chapter aims to explain a common pattern for enhancing local data with external content found at URLs or over APIs, such as GDELT and Twitter. We offer a tutorial using the GDELT news index service as a source of news URLS, demonstrating how to build a web scale News Scanner that scrapes global breaking news of interest from the internet. We further explain how to use the specialist web-scraping component in a way that overcomes the challenges of scale, followed by the summary of this chapter.

Chapter 7, Building Communities, this chapter aims to address a common use case in data science and big data. With more and more people interacting together, communicating, exchanging information, or simply sharing a common interest in different topics, the entire world can be represented as a Graph. A data scientist must be able to detect communities, find influencers / top contributors, and detect possible anomalies.

Chapter 8, Building a Recommendation System, if one were to choose an algorithm to showcase data science to the public, a recommendation system would certainly be in the frame. Today, recommendation systems are everywhere; the reason for their popularity is down to their versatility, usefulness and broad applicability. In this chapter, we will demonstrate how to recommend music content using raw audio signals.

Chapter 9, News Dictionary and Real-Time Tagging System, while a hierarchical data warehouse stores data in files of folders, a typical Hadoop based system relies on a flat architecture to store your data. Without a proper data governance or a clear understanding of what your data is all about, there is an undeniable chance of turning data lakes into swamps, where an interesting dataset such as GDELT would be nothing more than a folder containing a vast amount of unstructured text files. In this chapter, we will be describing an innovative way of labeling incoming GDELT data in a non-supervised way and in near real time.

Chapter 10, Story De-duplication and Mutation, in this chapter, we de-duplicate and index the GDELT database into stories, before tracking stories over time and understanding the links between them, how they may mutate and if they could lead to any subsequent events in the near future. Core to this chapter is the concept of Simhash to detect near duplicates and building vectors to reduce dimensionality using Random Indexing.

Chapter 11, Anomaly Detection and Sentiment Analysis, perhaps the most notable occurrence of the year 2016 was the tense US presidential election and its eventual outcome: the election of President Donald Trump, a campaign that will long be remembered; not least for its unprecedented use of social media and the stirring up of passion among its users, most of whom made their feelings known through the use of hashtags. In this chapter, instead of trying to predict the outcome itself, we will aim to detect abnormal tweets during the US election using a real-time Twitter feed.

Chapter 12, TrendCalculus, long before the concept of “what’s trending” became a popular topic of study by data scientists, there was an older one that is still not well served by data science; it is that of Trends. Presently, the analysis of trends, if it can be called that, is primarily carried out by people “eyeballing” time series charts and offering interpretations. But what is it that people’s eyes are doing? This chapter describes an implementation in Apache Spark of a new algorithm for studying trends numerically: TrendCalculus.

Chapter 13, Secure Data, throughout this book we visit many areas of data science, often straying into those that are not traditionally associated with a data scientist’s core working knowledge. In this chapter we will visit another of those often overlooked fields, Secure Data; more specifically, how to protect your data and analytic results at all stages of the data life cycle. Core to this chapter is the construction of a commercial grade encryption codec for Spark.

Chapter 14, Scalable Algorithms, in this chapter we learn about why sometimes even basic algorithms, despite working at small scale, will often fail in “big data”. We’ll see how to avoid issues when writing Spark jobs that run over massive Datasets and will learn about the structure of algorithms and how to write custom data science analytics that scale over petabytes of data. The chapter features areas such as: parallelization strategies, caching, shuffle strategies, garbage collection optimization and probabilistic models; explaining how these can help you to get the most out of the Spark paradigm.

What you need for this book

Spark 2.0 is used throughout the book along with Scala 2.11, Maven and Hadoop. This is the basic environment required, there are many other technologies used which are introduced in the relevant chapters.

Who this book is for

We presume that the data scientists reading this book are knowledgeable about data science, common machine learning methods, and popular data science tools, and have in the course of their work run proof of concept studies, and built prototypes. We offer a book that introduces advanced techniques and methods for building data science solutions to this audience, showing them how to construct commercial grade data products.

Reader feedback

Feedback from our readers is always welcome. Let us know what you think about this book-what you liked or disliked. Reader feedback is important for us as it helps us develop titles that you will really get the most out of. To send us general feedback, simply e-mail [email protected], and mention the book's title in the subject of your message. If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, see our author guide at www.packtpub.com/authors.

Customer support

Now that you are the proud owner of a Packt book, we have a number of things to help you to get the most from your purchase.

Downloading the example code

You can download the example code files for this book from your account at http://www.packtpub.com. If you purchased this book elsewhere, you can visit http://www.packtpub.com/support and register to have the files e-mailed directly to you.

You can download the code files by following these steps:

Log in or register to our website using your e-mail address and password.Hover the mouse pointer on the SUPPORT tab at the top.Click on Code Downloads & Errata.Enter the name of the book in the Search box.Select the book for which you're looking to download the code files.Choose from the drop-down menu where you purchased this book from.Click on Code Download.

Once the file is downloaded, please make sure that you unzip or extract the folder using the latest version of:

WinRAR / 7-Zip for WindowsZipeg / iZip / UnRarX for Mac7-Zip / PeaZip for Linux

The code bundle for the book is also hosted on GitHub at https://github.com/PacktPublishing/Mastering-Spark-for-Data-Science. We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/. Check them out!

Downloading the color images of this book

We also provide you with a PDF file that has color images of the screenshots/diagrams used in this book. The color images will help you better understand the changes in the output. You can download this file from https://www.packtpub.com/sites/default/files/downloads/MasteringSparkforDataScience_ColorImages.pdf.

Errata

Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you find a mistake in one of our books-maybe a mistake in the text or the code-we would be grateful if you could report this to us. By doing so, you can save other readers from frustration and help us improve subsequent versions of this book. If you find any errata, please report them by visiting http://www.packtpub.com/submit-errata, selecting your book, clicking on the Errata Submission Form link, and entering the details of your errata. Once your errata are verified, your submission will be accepted and the errata will be uploaded to our website or added to any list of existing errata under the Errata section of that title.

To view the previously submitted errata, go to https://www.packtpub.com/books/content/support and enter the name of the book in the search field. The required information will appear under the Errata section.

Piracy

Piracy of copyrighted material on the Internet is an ongoing problem across all media. At Packt, we take the protection of our copyright and licenses very seriously. If you come across any illegal copies of our works in any form on the Internet, please provide us with the location address or website name immediately so that we can pursue a remedy.

Please contact us at [email protected] with a link to the suspected pirated material.

We appreciate your help in protecting our authors and our ability to bring you valuable content.

Questions

If you have a problem with any aspect of this book, you can contact us at [email protected], and we will do our best to address the problem.

Chapter 1. The Big Data Science Ecosystem

As a data scientist, you'll no doubt be very familiar with handling files and processing perhaps even large amounts of data. However, as I'm sure you will agree, doing anything more than a simple analysis over a single type of data requires a method of organizing and cataloguing data so that it can be managed effectively. Indeed, this is the cornerstone of a great data scientist. As the data volume and complexity increases, a consistent and robust approach can be the difference between generalized success and over-fitted failure!

This chapter is an introduction to an approach and ecosystem for achieving success with data at scale. It focuses on the data science tools and technologies. It introduces the environment, and how to configure it appropriately, but also explains some of the nonfunctional considerations relevant to the overall data architecture. While there is little actual data science at this stage, it provides the essential platform to pave the way for success in the rest of the book.

In this chapter, we will cover the following topics:

Data management responsibilitiesData architectureCompanion tools

Introducing the Big Data ecosystem

Data management is of particular importance, especially when the data is in flux; either constantly changing or being routinely produced and updated. What is needed in these cases is a way of storing, structuring, and auditing data that allows for the continuous processing and refinement of models and results.

Here, we describe how to best hold and organize your data to integrate with Apache Spark and related tools within the context of a data architecture that is broad enough to fit the everyday requirement.

Data management

Even if, in the medium term, you only intend to play around with a bit of data at home; then without proper data management, more often than not, efforts will escalate to the point where it is easy to lose track of where you are and mistakes will happen. Taking the time to think about the organization of your data, and in particular, its ingestion, is crucial. There's nothing worse than waiting for a long running analytic to complete, collating the results and producing a report, only to discover you used the wrong version of data, or data is incomplete, has missing fields, or even worse you deleted your results!

The bad news is that, despite its importance, data management is an area that is consistently overlooked in both commercial and non-commercial ventures, with precious few off-the-shelf solutions available. The good news is that it is much easier to do great data science using the fundamental building blocks that this chapter describes.

Data management responsibilities

When we think about data, it is easy to overlook the true extent of the scope of the areas we need to consider. Indeed, most data "newbies" think about the scope in this way:

Obtain dataPlace the data somewhere (anywhere)Use the dataThrow the data away

In reality, there are a large number of other considerations, it is our combined responsibility to determine which ones apply to a given work piece. The following data management building blocks assist in answering or tracking some important questions about the data:

File integrity

Is the data file complete?How do you know?Was it part of a set?Is the data file correct?Was it tampered with in transit?

Data integrity

Is the data as expected?Are all of the fields present?Is there sufficient metadata?Is the data quality sufficient?Has there been any data drift?

Scheduling

Is the data routinely transmitted?How often does the data arrive?Was the data received on time?Can you prove when the data was received?Does it require acknowledgement?

Schema management

Is the data structured or unstructured?How should the data be interpreted?Can the schema be inferred?Has the data changed over time?Can the schema be evolved from the previous version?

Version Management

What is the version of the data?Is the version correct?How do you handle different versions of the data?How do you know which version you're using?

Security

Is the data sensitive?Does it contain personally identifiable information (PII)?Does it contain personal health information (PHI)?Does it contain payment card information (PCI)?How should I protect the data?Who is entitled to read/write the data?Does it require anonymization/sanitization/obfuscation/encryption?

Disposal

How do we dispose of the data?When do we dispose of the data?

If, after all that, you are still not convinced, before you go ahead and write that bash script using the gawk and crontab commands, keep reading and you will soon see that there is a far quicker, flexible, and safer method that allow you to start small and incrementally create commercial grade ingestion pipelines!

The right tool for the job

Apache Spark is the emerging de facto standard for scalable data processing. At the time of writing this book, it is the most active Apache Software Foundation (ASF) project and has a rich variety of companion tools available. There are new projects appearing every day, many of which overlap in functionality. So it takes time to learn what they do and decide whether they are appropriate to use. Unfortunately, there's no quick way around this. Usually, specific trade-offs must be made on a case-by-case basis; there is rarely a one-size-fits-all solution. Therefore, the reader is encouraged to explore the available tools and choose wisely!

Various technologies are introduced throughout this book, and the hope is that they will provide the reader with a taster of some of the more useful and practical ones to a level where they may start utilizing them in their own projects. And further, we hope to show that if the code is written carefully, technologies may be interchanged through clever use of Application Program Interface (APIs) (or high order functions in Spark Scala) even when a decision is proved to be incorrect.

Overall architecture

Let's start with a high-level introduction to data architectures: what they do, why they're useful, when they should be used, and how Apache Spark fits in.

At their most general, modern data architectures have four basic characteristics:

Data IngestionData LakeData ScienceData Access

Let's introduce each of these now, so that we can go into more detail in the later chapters.

Data Ingestion

Traditionally, data is ingested under strict rules and formatted according to a predetermined schema. This process is known as Extract, Transform, Load (ETL), and is still a very common practice supported by a large array of commercial tools as well as some open source products.

The ETL approach favors performing up-front checks, which ensure data quality and schema conformance, in order to simplify follow-on online analytical processing. It is particularly suited to handling data with a specific set of characteristics, namely, those that relate to a classical entity-relationship model. However, it is not suitable for all scenarios.

During the big data revolution, there was a metaphorical explosion of demand for structured, semi-structured, and unstructured data, leading to the creation of systems that were required to handle data with a different set of characteristics. These came to be defined by the phrase, 4 Vs: Volume, Variety, Velocity, and Veracityhttp://www.ibmbigdatahub.com/infographic/four-vs-big-data. While traditional ETL methods floundered under this new burden-because they simply required too much time to process the vast quantities of data, or were too rigid in the face of change, a different approach emerged. Enter the schema-on-read paradigm. Here, data is ingested in its original form (or at least very close to) and the details of normalization, validation, and so on are done at the time of analytical processing.

This is typically referred to as Extract Load Transform (ELT), a reference to the traditional approach:

This approach values the delivery of data in a timely fashion, delaying the detailed processing until it is absolutely required. In this way, a data scientist can gain access to the data immediately, searching for insight using a range of techniques not available with a traditional approach.

Although we only provide a high-level overview here, this approach is so important that throughout the book we will explore further by implementing various schema-on-read algorithms. We will assume the ELT method for data ingestion, that is to say we encourage the loading of data at the user's convenience. This may be every n minute, overnight or during times of low usage. The data can then be checked for integrity, quality, and so forth by running batch processing jobs offline, again at the user's discretion.

Data Lake

A data lake is a convenient, ubiquitous store of data. It is useful because it provides a number of key benefits, primarily:

Reliable storageScalable data processing capability

Let's take a brief look at each of these.

Reliable storage

There is a good choice of underlying storage implementations for a data lake, these include Hadoop Distributed File System (HDFS), MapR-FS, and Amazon AWS S3.

Throughout the book, HDFS will be the assumed storage implementation. Also, in this book the authors use a distributed Spark setup, deployed on Yet Another Resource Negotiator (YARN) running inside a Hortonworks HDP environment. Therefore, HDFS is the technology used, unless otherwise stated. If you are not familiar with any of these technologies, they are discussed further on in this chapter.

In any case, it's worth knowing that Spark references HDFS locations natively, accesses local file locations via the prefix file:// and references S3 locations via the prefix s3a://.

Scalable data processing capability

Clearly, Apache Spark will be our data processing platform of choice. In addition, as you may recall, Spark allows the user to execute code in their preferred environment, be that local, standalone, YARN or Mesos, by configuring the appropriate cluster manager; in masterURL. Incidentally, this can be done in any one of the three locations:

Using the --master option when issuing the spark-submit commandAdding the spark.master property in the conf/spark-defaults.conf fileInvoking the setMaster method on the SparkConf object

If you're not familiar with HDFS, or if you do not have access to a cluster, then you can run a local Spark instance using the local filesystem, which is useful for testing. However, beware that there are often bad behaviors that only appear when executing on a cluster. So, if you're serious about Spark, it's worth investing in a distributed cluster manager why not try Spark standalone cluster mode, or Amazon AWS EMR? For example, Amazon offers a number of affordable paths to cloud computing, you can explore the idea of spot instances at https://aws.amazon.com/ec2/spot/.

Data science platform

A data science platform provides services and APIs that enable effective data science to take place, including explorative data analysis, machine learning model creation and refinement, image and audio processing, natural language processing, and text sentiment analysis.

This is the area where Spark really excels and forms the primary focus of the remainder of this book, exploiting a robust set of native machine learning libraries, unsurpassed parallel graph processing capabilities and a strong community. Spark provides truly scalable opportunities for data science.

The remaining chapters will provide insight into each of these areas, including Chapter 6, Scraping Link-Based External Data, Chapter 7, Building Communities, and Chapter 8, Building a Recommendation System.

Data Access

Data in a data lake is most frequently accessed by data engineers and scientists using the Hadoop ecosystem tools, such as Apache Spark, Pig, Hive, Impala, or Drill. However, there are times when other users, or even other systems, need access to the data and the normal tools are either too technical or do not meet the demanding expectations of the user in terms of real-world latency.

In these circumstances, the data often needs to be copied into data marts or index stores so that it may be exposed to more traditional methods, such as a report or dashboard. This process, which typically involves creating indexes and restructuring data for low-latency access, is known as data egress.

Fortunately, Apache Spark has a wide variety of adapters and connectors into traditional databases, BI tools, and visualization and reporting software. Many of these will be introduced throughout the book.

Data technologies

When Hadoop first started, the word Hadoop referred to the combination of HDFS and the MapReduce processing paradigm, as that was the outline of the original paper http://research.google.com/archive/mapreduce.html. Since that time, a plethora of technologies have emerged to complement Hadoop, and with the development of Apache YARN we now see other processing paradigms emerge such as Spark.

Hadoop is now often used as a colloquialism for the entire big data software stack and so it would be prudent at this point to define the scope of that stack for this book. The typical data architecture with a selection of technologies we will visit throughout the book is detailed as follows:

The relationship between these technologies is a dense topic as there are complex interdependencies, for example, Spark depends on GeoMesa, which depends on Accumulo, which depends on Zookeeper and HDFS! Therefore, in order to manage these relationships, there are platforms available, such as Cloudera or Hortonworks HDP http://hortonworks.com/products/sandbox/. These provide consolidated user interfaces and centralized configuration. The choice of platform is that of the reader, however, it is not recommended to install a few of the technologies initially and then move to a managed platform as the version problems encountered will be very complex. Therefore, it is usually easier to start with a clean machine and make a decision upfront as to which direction to take.

All of the software we use in this book is platform-agnostic and therefore fits into the general architecture described earlier. It can be installed independently and it is relatively straightforward to use with single or multiple server environment without the use of a managed product.

The role of Apache Spark

In many ways, Apache Spark is the glue that holds these components together. It increasingly represents the hub of the software stack. It integrates with a wide variety of components but none of them are hard-wired. Indeed, even the underlying storage mechanism can be swapped out. Combining this feature with the ability to leverage different processing frameworks means the original Hadoop technologies effectively become components, rather than an imposing framework. The logical diagram of our architecture appears as follows:

As Spark has gained momentum and wide-scale industry acceptance, many of the original Hadoop implementations for various components have been refactored for Spark. Thus, to add further complexity to the picture, there are often several possible ways to programmatically leverage any particular component; not least the imperative and declarative versions depending upon whether an API has been ported from the original Hadoop Java implementation. We have attempted to remain as true as possible to the Spark ethos throughout the remaining chapters.