Machine Learning with Apache Spark Quick Start Guide - Jillur Quddus - E-Book

Machine Learning with Apache Spark Quick Start Guide E-Book

Jillur Quddus

0,0
23,99 €

-100%
Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.

Mehr erfahren.
Beschreibung

Every person and every organization in the world manages data, whether they realize it or not. Data is used to describe the world around us and can be used for almost any purpose, from analyzing consumer habits to fighting disease and serious organized crime. Ultimately, we manage data in order to derive value from it, and many organizations around the world have traditionally invested in technology to help process their data faster and more efficiently.
But we now live in an interconnected world driven by mass data creation and consumption where data is no longer rows and columns restricted to a spreadsheet, but an organic and evolving asset in its own right. With this realization comes major challenges for organizations: how do we manage the sheer size of data being created every second (think not only spreadsheets and databases, but also social media posts, images, videos, music, blogs and so on)? And once we can manage all of this data, how do we derive real value from it?
The focus of Machine Learning with Apache Spark is to help us answer these questions in a hands-on manner. We introduce the latest scalable technologies to help us manage and process big data. We then introduce advanced analytical algorithms applied to real-world use cases in order to uncover patterns, derive actionable insights, and learn from this big data.

Das E-Book können Sie in Legimi-Apps oder einer beliebigen App lesen, die das folgende Format unterstützen:

EPUB
MOBI

Seitenzahl: 275

Veröffentlichungsjahr: 2018

Bewertungen
0,0
0
0
0
0
0
Mehr Informationen
Mehr Informationen
Legimi prüft nicht, ob Rezensionen von Nutzern stammen, die den betreffenden Titel tatsächlich gekauft oder gelesen/gehört haben. Wir entfernen aber gefälschte Rezensionen.



Machine Learning with Apache Spark Quick Start Guide
Uncover patterns, derive actionable insights, and learn from big data using MLlib
Jillur Quddus
BIRMINGHAM - MUMBAI

Machine Learning with Apache Spark Quick Start Guide

Copyright © 2018 Packt Publishing

All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing or its dealers and distributors, will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book.

Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

Commissioning Editor: Amey VarangaonkarAcquisition Editor: Siddharth MandalContent Development Editor: Mohammed Yusuf ImaratwaleTechnical Editor: Diksha WakodeCopy Editor: Safis EditingProject Coordinator: Kinjal BariProofreader: Safis EditingIndexer: Rekha NairGraphics: Alishon MendonsaProduction Coordinator: Aparna Bhagat

First published: December 2018

Production reference: 1211218

Published by Packt Publishing Ltd. Livery Place 35 Livery Street Birmingham B3 2PB, UK.

ISBN 978-1-78934-656-5

www.packtpub.com

To my wife and best friend, Jane, for making life worth living. And to the memory of my parents, for their sacrifices and giving me the freedom to explore my imagination.
– Jillur Quddus
mapt.io

Mapt is an online digital library that gives you full access to over 5,000 books and videos, as well as industry leading tools to help you plan your personal development and advance your career. For more information, please visit our website.

Why subscribe?

Spend less time learning and more time coding with practical ebooks and videos from over 4,000 industry professionals

Improve your learning with Skill Plans built especially for you

Get a free ebook or video every month

Mapt is fully searchable

Copy and paste, print, and bookmark content

Packt.com

Did you know that Packt offers ebook versions of every book published, with PDF and ePub files available? You can upgrade to the ebook version at www.packt.com and as a print book customer, you are entitled to a discount on the ebook copy. Get in touch with us at [email protected] for more details.

At www.packt.com, you can also read a collection of free technical articles, sign up for a range of free newsletters, and receive exclusive discounts and offers on Packt books and ebooks.

Contributors

About the author

Jillur Quddus is a lead technical architect, polyglot software engineer and data scientist with over 10 years of hands-on experience in architecting and engineering distributed, scalable, high-performance, and secure solutions used to combat serious organized crime, cybercrime, and fraud. Jillur has extensive experience of working within central government, intelligence, law enforcement, and banking, and has worked across the world including in Japan, Singapore, Malaysia, Hong Kong, and New Zealand. Jillur is both the founder of Keisan, a UK-based company specializing in open source distributed technologies and machine learning, and the lead technical architect at Methods, the leading digital transformation partner for the UK public sector.

First and foremost, I would like to thank my wonderful and gorgeous wife, Jane, for all her love, support, and general awesomeness. This book, and indeed all the moments of happiness in my life, would not have been possible without her. Thank you also to my amazing brother, Gipil Quddus, and life-long friends Rie Tokuoka, Tatsuya Mukai, Nori Tokuoka and (the incredibly lovely) Shan Gao for fuelling my imagination with weird and wonderful ideas.

About the reviewer

Emmanuel Asimadi is a data scientist currently focusing on natural language processing as applied to the domain of customer experience. He has an MSc in cloud computing from the University of Leicester, UK, with over a decade experience in a variety of analytic roles both in academic research and industry. His varied portfolio includes projects in Apache Spark, natural language processing, semantic web, and telecommunications operations management involving the creation and maintenance of ETL services that support telecom infrastructure operations and maintenance using data from thousands of nodes in the field.

Emmanuel also co-authored a video called Advanced Machine Learning with Spark and has made a significant contribution to the development of the video Big Data Analytics Projects with Apache Spark, which was published recently by Packt Publishing.

I love everything to do with data and the value it delivers, and I feel very fortunate to be part of its creation and the first-hand experience thereof. I am a big fan of Apache Spark because of its unified approach and simple APIs, which align very well with my general philosophy of teaching – that even the most complex concepts can be explained simply – and it is great to see this applied in this book.

Packt is searching for authors like you

If you're interested in becoming an author for Packt, please visit authors.packtpub.com and apply today. We have worked with thousands of developers and tech professionals, just like you, to help them share their insight with the global tech community. You can make a general application, apply for a specific hot topic that we are recruiting an author for, or submit your own idea.

Table of Contents

Title Page

Copyright and Credits

Machine Learning with Apache Spark Quick Start Guide

Dedication

About Packt

Why subscribe?

Packt.com

Contributors

About the author

About the reviewer

Packt is searching for authors like you

Preface

Who this book is for

What this book covers

To get the most out of this book

Download the example code files

Conventions used

Get in touch

Reviews

The Big Data Ecosystem

A brief history of data

Vertical scaling

Master/slave architecture

Sharding

Data processing and analysis

Data becomes big

Big data ecosystem

Horizontal scaling

Distributed systems

Distributed data stores

Distributed filesystems

Distributed databases

NoSQL databases

Document databases

Columnar databases

Key-value databases

Graph databases

CAP theorem

Distributed search engines

Distributed processing

MapReduce

Apache Spark

RDDs, DataFrames, and datasets

RDDs

DataFrames

Datasets

Jobs, stages, and tasks

Job

Stage

Tasks

Distributed messaging

Distributed streaming

Distributed ledgers

Artificial intelligence and machine learning

Cloud computing platforms

Data insights platform

Reference logical architecture

Data sources layer

Ingestion layer

Persistent data storage layer

Data processing layer

Serving data storage layer

Data intelligence layer

Unified access layer

Data insights and reporting layer

Platform governance, management, and administration

Open source implementation

Summary

Setting Up a Local Development Environment

CentOS Linux 7 virtual machine

Java SE Development Kit 8

Scala 2.11

Anaconda 5 with Python 3

Basic conda commands

Additional Python packages

Jupyter Notebook

Starting Jupyter Notebook

Troubleshooting Jupyter Notebook

Apache Spark 2.3

Spark binaries

Local working directories

Spark configuration

Spark properties

Environmental variables

Standalone master server

Spark worker node

PySpark and Jupyter Notebook

Apache Kafka 2.0

Kafka binaries

Local working directories

Kafka configuration

Start the Kafka server

Testing Kafka

Summary

Artificial Intelligence and Machine Learning

Artificial intelligence

Machine learning

Supervised learning

Unsupervised learning

Reinforced learning

Deep learning

Natural neuron

Artificial neuron

Weights

Activation function

Heaviside step function

Sigmoid function

Hyperbolic tangent function

Artificial neural network

Single-layer perceptron

Multi-layer perceptron

NLP

Cognitive computing

Machine learning pipelines in Apache Spark

Summary

Supervised Learning Using Apache Spark

Linear regression

Case study – predicting bike sharing demand

Univariate linear regression

Residuals

Root mean square error

R-squared

Univariate linear regression in Apache Spark

Multivariate linear regression

Correlation

Multivariate linear regression in Apache Spark

Logistic regression

Threshold value

Confusion matrix

Receiver operator characteristic curve

Area under the ROC curve

Case study – predicting breast cancer

Classification and Regression Trees

Case study – predicting political affiliation

Random forests

K-Fold cross validation

Summary

Unsupervised Learning Using Apache Spark

Clustering

Euclidean distance

Hierarchical clustering

K-means clustering

Case study – detecting brain tumors

Feature vectors from images

Image segmentation

K-means cost function

K-means clustering in Apache Spark

Principal component analysis

Case study – movie recommendation system

Covariance matrix

Identity matrix

Eigenvectors and eigenvalues

PCA in Apache Spark

Summary

Natural Language Processing Using Apache Spark

Feature transformers

Document

Corpus

Preprocessing pipeline

Tokenization

Stop words

Stemming

Lemmatization

Normalization

Feature extractors

Bag of words

Term frequency–inverse document frequency

Case study – sentiment analysis

NLP pipeline

NLP in Apache Spark

Summary

Deep Learning Using Apache Spark

Artificial neural networks

Multilayer perceptrons

MLP classifier

Input layer

Hidden layers

Output layer

Case study 1 – OCR

Input data

Training architecture

Detecting patterns in the hidden layer

Classifying in the output layer

MLPs in Apache Spark

Convolutional neural networks

End-to-end neural architecture

Input layer

Convolution layers

Rectified linear units

Pooling layers

Fully connected layer

Output layer

Case study 2 – image recognition

InceptionV3 via TensorFlow

Deep learning pipelines for Apache Spark

Image library

PySpark image recognition application

Spark submit

Image-recognition results

Case study 3 – image prediction

PySpark image-prediction application

Image-prediction results

Summary

Real-Time Machine Learning Using Apache Spark

Distributed streaming platform

Distributed stream processing engines

Streaming using Apache Spark

Spark Streaming (DStreams)

Structured Streaming

Stream processing pipeline

Case study – real-time sentiment analysis

Start Zookeeper and Kafka Servers

Kafka topic

Twitter developer account

Twitter apps and the Twitter API

Application configuration

Kafka Twitter producer application

Preprocessing and feature vectorization pipelines

Kafka Twitter consumer application

Summary

Other Books You May Enjoy

Leave a review - let other readers know what you think

Preface

Every person and every organization in the world manages data, whether they realize it or not. Data is used to describe the world around us and can be used for almost any purpose, from analyzing consumer habits in order to recommend the latest products and services to fighting disease, climate change, and serious organized crime. Ultimately, we manage data in order to derive value from it, whether personal or business value, and many organizations around the world have traditionally invested in tools and technologies to help them process their data faster and more efficiently in order to deliver actionable insights.

But we now live in a highly interconnected world driven by mass data creation and consumption, where data is no longer rows and columns restricted to a spreadsheet but an organic and evolving asset in its own right. With this realization comes major challenges for organizations as we enter the intelligence-driven fourth industrial revolution—how do we manage the sheer amount of data being created every second in all of its various formats (think not only spreadsheets and databases, but also social media posts, images, videos, music, online forums and articles, computer log files, and more)? And once we know how to manage all of this data, how do we know what questions to ask of it in order to derive real personal or business value?

The focus of this book is to help us answer those questions in a hands-on manner starting from first principles. We introduce the latest cutting-edge technologies (the big data ecosystem, including Apache Spark) that can be used to manage and process big data. We then explore advanced classes of algorithms (machine learning, deep learning, natural language processing, and cognitive computing) that can be applied to the big data ecosystem to help us uncover previously hidden relationships in order to understand what the data is telling us so that we may ultimately solve real-world challenges.

Who this book is for

This book is aimed at business analysts, data analysts, data scientists, data engineers, and software engineers for whom a typical day may currently involve analyzing data using spreadsheets or relational databases, perhaps using VBA, Structured Query Language (SQL), or even Python to compute statistical aggregations (such as averages) and to generate graphs, charts, pivot tables and other reporting mediums.

With the explosion of data in all of its various formats and frequencies, perhaps you are now challenged with not only managing all of that data, but understanding what it is telling you. You have most likely heard the terms big data, artificial intelligence, and machine learning, but now wish to understand where to start in order to take advantage of these new technologies and frameworks, not just in theory but in practice as well, to solve your business challenges. If this sounds familiar, then this book is for you!

What this book covers

Chapter 1, The Big Data Ecosystem, provides an introduction to the current big data ecosystem. With the multitude of on-premises and cloud-based technologies, tools, services, libraries, and frameworks available in the big data, artificial intelligence, and machine learning space (and growing every day!), it is vitally important to understand the logical function of each layer within the big data ecosystem so that we may understand how they integrate with each other in order to ultimately architect and engineer end-to-end data intelligence and machine learning pipelines. This chapter also provides a logical introduction to Apache Spark within the context of the wider big data ecosystem.

Chapter 2, Setting Up a Local Development Environment, provides a detailed and hands-on guide to installing, configuring, and deploying a local Linux-based development environment on your personal desktop, laptop, or cloud-based infrastructure. You will learn how to install and configure all the software services required for this book in one self-contained location, including installing and configuring prerequisite programming languages (Java JDK 8 and Python 3), a distributed data processing and analytics engine (Apache Spark 2.3), a distributed real-time streaming platform (Apache Kafka 2.0), and a web-based notebook for interactive data insights and analytics (Jupyter Notebook).

Chapter 3, Artificial Intelligence and Machine Learning, provides a concise theoretical summary of the various applied subjects that fall under the artificial intelligence field of study, including machine learning, deep learning, and cognitive computing. This chapter also provides a logical introduction into how end-to-end data intelligence and machine learning pipelines may be architected and engineered using Apache Spark and its machine learning library, MLlib.

Chapter 4, Supervised Learning Using Apache Spark, provides a hands-on guide to engineering, training, validating, and interpreting the results of supervised machine learning algorithms using Apache Spark through real-world use-cases. The chapter describes and implements commonly used classification and regression techniques including linear regression, logistic regression, classification and regression trees (CART), and random forests.

Chapter 5, Unsupervised Learning Using Apache Spark, provides a hands-on guide to engineering, training, validating, and interpreting the results of unsupervised machine learning algorithms using Apache Spark through real-world use-cases. The chapter describes and implements commonly-used unsupervised techniques including hierarchical clustering, K-means clustering, and dimensionality reduction via Principal Component Analysis (PCA).

Chapter 6, Natural Language Processing Using Apache Spark, provides a hands-on guide to engineering natural language processing (NLP) pipelines using Apache Spark through real-world use-cases. The chapter describes and implements commonly used NLP techniques including tokenisation, stemming, lemmatization, normalization, and other feature transformers, and feature extractors such as the bag of words and Term Frequency-Inverse Document Frequency (TF-IDF) algorithms.

Chapter 7, Deep Learning Using Apache Spark, provides a hands-on exploration of the exciting and cutting-edge world of deep learning! The chapter uses third-party deep learning libraries in conjunction with Apache Spark to train and interpret the results of Artificial Neural Networks (ANNs) including Multi-Layer Perceptrons (MLPs) and Convolutional Neural Networks (CNNs) applied to real-world use-cases.

Chapter 8, Real-Time Machine Learning Using Apache Spark, extends the deployment of machine learning models beyond batch processing in order to learn from data, make predictions, and identify trends in real-time! The chapter provides a hands-on guide to engineering and deploying real-time stream processing and machine learning pipelines using Apache Spark and Apache Kafka to transport, transform, and analyze data streams as they are being created around the world.

To get the most out of this book

Though this book aims to explain everything from first principles, it would be advantageous (though not strictly required) to have a basic knowledge of mathematical notation and basic programming skills in a language that can be used for data transformation, such as SQL, Base SAS, R, or Python. A good website for beginners to learn about SQL and Python is https://www.w3schools.com.

It is assumed that you have access to a physical or virtual machine provisioned with the CentOS Linux 7 (or Red Hat Linux) operating system. If you do not, Chapter 2, Setting Up a Local Development Environment, describes the various options available to provision a CentOS 7 virtual machine (VM), including via cloud-computing platforms such as Amazon Web Services (AWS), Microsoft Azure, and Google Cloud Platform (GCP), virtual private server hosting companies or free virtualization software such as Oracle VirtualBox and VMWare Workstation Player that can be installed on your local physical device, such as a desktop or laptop.

A basic knowledge of Linux shell commands is required in order to install, configure, and provision a self-contained local development environment hosting the prerequisite software services detailed in Chapter 2, Setting Up a Local Development Environment. A good website for beginners to learn about the Linux command line is http://linuxcommand.org.

Download the example code files

You can download the example code files for this book from your account at www.packt.com. If you purchased this book elsewhere, you can visit www.packt.com/support and register to have the files emailed directly to you.

You can download the code files by following these steps:

Log in or register at

www.packt.com

.

Select the

SUPPORT

tab.

Click on

Code Downloads & Errata

.

Enter the name of the book in the

Search

box and follow the onscreen instructions.

Once the file is downloaded, please make sure that you unzip or extract the folder using the latest version of:

WinRAR/7-Zip for Windows

Zipeg/iZip/UnRarX for Mac

7-Zip/PeaZip for Linux

The code bundle for the book is also hosted on GitHub athttps://github.com/PacktPublishing/Machine-Learning-with-Apache-Spark-Quick-Start-Guide. In case there's an update to the code, it will be updated on the existing GitHub repository.

We also have other code bundles from our rich catalog of books and videos available athttps://github.com/PacktPublishing/. Check them out!

Conventions used

There are a number of text conventions used throughout this book.

CodeInText: Indicates code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles. Here is an example: "Mount the downloaded WebStorm-10*.dmg disk image file as another disk in your system."

A block of code is set as follows:

import findsparkfindspark.init()from pyspark import SparkContext, SparkConfimport random

Any command-line input or output is written as follows:

> source /etc/profile.d/java.sh

> echo $PATH

> echo $JAVA_HOME

Bold: Indicates a new term, an important word, or words that you see onscreen. For example, words in menus or dialog boxes appear in the text like this. Here is an example: "Select System info from the Administration panel."

Warnings or important notes appear like this.
Tips and tricks appear like this.

Get in touch

Feedback from our readers is always welcome.

General feedback: If you have questions about any aspect of this book, mention the book title in the subject of your message and email us at [email protected].

Errata: Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you have found a mistake in this book, we would be grateful if you would report this to us. Please visit www.packt.com/submit-errata, selecting your book, clicking on the Errata Submission Form link, and entering the details.

Piracy: If you come across any illegal copies of our works in any form on the Internet, we would be grateful if you would provide us with the location address or website name. Please contact us at [email protected] with a link to the material.

If you are interested in becoming an author: If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, please visit authors.packtpub.com.

Reviews

Please leave a review. Once you have read and used this book, why not leave a review on the site that you purchased it from? Potential readers can then see and use your unbiased opinion to make purchase decisions, we at Packt can understand what you think about our products, and our authors can see your feedback on their book. Thank you!

For more information about Packt, please visit packt.com.

The Big Data Ecosystem

Modern technology has transformed the very essence of what we mean by data. Whereas previously, data was traditionally thought of as text and numbers confined to spreadsheets or relational databases, today, it is an organic and evolving asset in its own right, being created and consumed on a mass scale by anyone that owns a smartphone, TV, or bank account. In this chapter, we will explore the new ecosystem of cutting-edge tools, technologies, and frameworks that allow us to store, process, and analyze massive volumes of data in order to deliver actionable insights and solve real-world problems. By the end of this chapter, you will have gained a high-level understanding of the following cutting-edge technology classes:

Distributed systems

NoSQL databases

Artificial intelligence and machine learning frameworks

Cloud computing platforms

Big data platforms and reference architecture

A brief history of data

If you worked in the mainstream IT industry between the 1970s and early 2000s, it is likely that your organization's data was held either in text-based delimited files, spreadsheets, or nicely structured relational databases. In the case of the latter, data is modeled and persisted in pre-defined, and possibly related, tables representing the various entities found within your organization's data model, for example, according to employee or department. These tables contain rows of data across multiple columns representing the various attributes making up that entity; for example, in the case of employee, typical attributes include first name, last name, and date of birth.

Vertical scaling

As both your organization's data estate and the number of users requiring access to that data grew, high-performance remote servers would have been utilized, with access provisioned over the corporate network. These remote servers would typically either act as remote filesystems for file sharing or host relational database management systems (RDBMSes) in order to store and manage relational databases. As data requirements grew, these remote servers would have needed to scale vertically, meaning that additional CPU, memory, and/or hard disk space would have been installed. Typically, these relational databases would have stored anything between hundreds and potentially tens of millions of records.

Master/slave architecture

As a means of providing resilience and load balancing read requests, potentially, a master/slave architecture would have been employed whereby data is automatically copied from the master database server to physically distinct slave database server(s) utilizing near real-time replication. This technique requires that the master server be responsible for all write requests, while read requests could be offloaded and load balanced across the slaves, where each slave would hold a full copy of the master data. That way, if the master server ever failed for some reason, business-critical read requests could still be processed by the slaves while the master was being brought back online. This technique does have a couple of major disadvantages, however:

Scalability

: T

he master server, by being solely responsible for processing write requests, limits the ability for the system to be scalable as it could quickly become a bottleneck.

Consistency and data loss

: Since replication is near

real-time, it is not guaranteed that the slaves would have the latest data at the point in time that the master server goes offline and transactions may be lost. Depending on the business application, either not having the latest data or losing data may be unacceptable.

Sharding

To increase throughput and overall performance, and as single machines reached their capacity to scale vertically in a cost-effective manner, it is possible that sharding would have been employed. This is one method of horizontal scaling whereby additional servers are provisioned and data is physically split over separate database instances residing on each of the machines in the cluster, as illustrated in Figure 1.1.

This approach would have allowed organizations to scale linearly to cater for increased data sizes while reusing existing database technologies and commodity hardware, thereby optimizing costs and performance for small- to medium-sized databases.

Crucially, however, these separate databases are standalone instances and have no knowledge of one another. Therefore, some sort of broker would be required that, based on a partitioning strategy, would keep track of where data was being written to for each write request and, thereafter, retrieve data from that same location for read requests. Sharding subsequently introduced further challenges, such as processing data queries, transformations, and joins that spanned multiple standalone database instances across multiple servers (without denormalizing data), thereby maintaining referential integrity and the repartitioning of data:

Figure 1.1: A simple sharding partitioning strategy

Data processing and analysis

Finally, in order to transform, process, and analyze the data sitting in these delimited text-based files, spreadsheets or relational databases, typically an analyst, data engineer or software engineer would have written some code.

This code, for example, could take the form of formulas or Visual Basic for Applications (VBA) for spreadsheets, or Structured Query Language (SQL) for relational databases, and would be used for the following purposes:

Loading data, including batch loading and data migration

Transforming data, including data cleansing, joins, merges, enrichment, and validation

Standard statistical aggregations, including computing averages, counts, totals, and pivot tables

Reporting, including graphs, charts, tables, and dashboards

To perform more complex statistical calculations, such as generating predictive models, advanced analysts could utilize more advanced programming languages, including Python, R, SAS, or even Java.

Crucially, however, this data transformation, processing, and analysis would have either been executed directly on the server in which the data was persisted (for example, SQL statements executed directly on the relational database server in competition with other business-as-usual read and write requests), or data would be moved over the network via a programmatic query (for example, an ODBC or JDBC connection), or via flat files (for example, CSV or XML files) to another remote analytical processing server. The code could then be executed on that data, assuming, of course, that the remote processing server had sufficient CPUs, memory and/or disk space in its single machine to execute the job in question. In other words, the data would have been moved to the code in some way or another.

Data becomes big

Fast forward to today—spreadsheets are still commonplace, and relational databases containing nicely structured data, whether partitioned across shards or not, are still very much relevant and extremely useful. In fact, depending on the use case, the data volumes, structure, and the computational complexity of the required processing, it could still be faster and more efficient to store and manage data via an RDBMS and process that data directly on the remote database server using SQL. And, of course, spreadsheets are still great for very small datasets and for simple statistical aggregations. What has changed, however, since the 1970s is the availability of more powerful and more cost-effective technology coupled with the introduction of the internet!

The internet has transformed the very essence of what we mean by data. Whereas before, data was thought of as text and numbers confined to spreadsheets or relational databases, it is now an organic and evolving asset in its own right being created and consumed on a mass scale by anyone that owns a smartphone, TV, or bank account. Data is being created every second around the world in virtually any format you can think of, from social media posts, images, videos, audio, and music to blog posts, online forums, articles, computer log files, and financial transactions. All of this structured, semi-structured, and unstructured data being created in both batche and real time can no longer be stored and managed by nicely organized, text-based delimited files, spreadsheets, or relational databases, nor can it all be physically moved to a remote processing server every time some analytical code is to be executed—a new breed of technology is required.

Big data ecosystem

If you work in almost any mainstream industry today, chances are that you may have heard of some of the following terms and phrases:

Big data

Distributed, scalable, and elastic

On-premise versus the cloud

SQL versus NoSQL

Artificial intelligence, machine learning, and deep learning

But what do all these terms and phrases actually mean, how do they all fit together, and where do you start? The aim of this section is to answer all of those questions in a clear and concise manner.

Horizontal scaling

First of all, let's return to some of the data-centric problems that we described earlier. Given the huge explosion in the mass creation and consumption of data today, clearly we cannot continue to keep adding CPUs, memory, and/or hard drives to a single machine (in other words, vertical scaling). If we did, there would very quickly come a point where migrating to more powerful hardware would lead to diminishing returns while incurring significant costs. Furthermore, the ability to scale would be physically bounded by the biggest machine available to us, thereby limiting the growth potential of an organization.

Horizontal scaling, of which sharding is an example, is the process by which we can increase or decrease the amount of computational resources available to us via the addition or removal of hardware and/or software. Typically, this would involve the addition (or removal) of servers or nodes to a cluster of nodes. Crucially, however, the cluster acts as a single logical unit at all times, meaning that it will still continue to function and process requests regardless of whether resources were being added to it or taken away. The difference between horizontal and vertical scaling is illustrated in Figure 1.2:

Figure 1.2: Vertical scaling versus horizontal scaling

Distributed systems

Horizontal scaling allows organizations to become much more cost efficient when data and processing requirements grow beyond a certain point. But simply adding more machines to a cluster would not be of much value by itself. What we now need are systems that are capable of taking advantage of horizontal scalability and that work across multiple machines seamlessly, irrespective of whether the cluster contains one machine or 10,000 machines.

Distributed systems do precisely that—they work seamlessly across a cluster of machines and automatically deal with the addition (or removal) of resources from that cluster. Distributed systems can be broken down into the following types:

Distributed filesystems

Distributed databases

Distributed processing

Distributed messaging

Distributed streaming

Distributed ledgers

Distributed data stores