23,99 €
Every person and every organization in the world manages data, whether they realize it or not. Data is used to describe the world around us and can be used for almost any purpose, from analyzing consumer habits to fighting disease and serious organized crime. Ultimately, we manage data in order to derive value from it, and many organizations around the world have traditionally invested in technology to help process their data faster and more efficiently.
But we now live in an interconnected world driven by mass data creation and consumption where data is no longer rows and columns restricted to a spreadsheet, but an organic and evolving asset in its own right. With this realization comes major challenges for organizations: how do we manage the sheer size of data being created every second (think not only spreadsheets and databases, but also social media posts, images, videos, music, blogs and so on)? And once we can manage all of this data, how do we derive real value from it?
The focus of Machine Learning with Apache Spark is to help us answer these questions in a hands-on manner. We introduce the latest scalable technologies to help us manage and process big data. We then introduce advanced analytical algorithms applied to real-world use cases in order to uncover patterns, derive actionable insights, and learn from this big data.
Das E-Book können Sie in Legimi-Apps oder einer beliebigen App lesen, die das folgende Format unterstützen:
Seitenzahl: 275
Veröffentlichungsjahr: 2018
Copyright © 2018 Packt Publishing
All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.
Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing or its dealers and distributors, will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book.
Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.
Commissioning Editor: Amey VarangaonkarAcquisition Editor: Siddharth MandalContent Development Editor: Mohammed Yusuf ImaratwaleTechnical Editor: Diksha WakodeCopy Editor: Safis EditingProject Coordinator: Kinjal BariProofreader: Safis EditingIndexer: Rekha NairGraphics: Alishon MendonsaProduction Coordinator: Aparna Bhagat
First published: December 2018
Production reference: 1211218
Published by Packt Publishing Ltd. Livery Place 35 Livery Street Birmingham B3 2PB, UK.
ISBN 978-1-78934-656-5
www.packtpub.com
Mapt is an online digital library that gives you full access to over 5,000 books and videos, as well as industry leading tools to help you plan your personal development and advance your career. For more information, please visit our website.
Spend less time learning and more time coding with practical ebooks and videos from over 4,000 industry professionals
Improve your learning with Skill Plans built especially for you
Get a free ebook or video every month
Mapt is fully searchable
Copy and paste, print, and bookmark content
Did you know that Packt offers ebook versions of every book published, with PDF and ePub files available? You can upgrade to the ebook version at www.packt.com and as a print book customer, you are entitled to a discount on the ebook copy. Get in touch with us at [email protected] for more details.
At www.packt.com, you can also read a collection of free technical articles, sign up for a range of free newsletters, and receive exclusive discounts and offers on Packt books and ebooks.
Jillur Quddus is a lead technical architect, polyglot software engineer and data scientist with over 10 years of hands-on experience in architecting and engineering distributed, scalable, high-performance, and secure solutions used to combat serious organized crime, cybercrime, and fraud. Jillur has extensive experience of working within central government, intelligence, law enforcement, and banking, and has worked across the world including in Japan, Singapore, Malaysia, Hong Kong, and New Zealand. Jillur is both the founder of Keisan, a UK-based company specializing in open source distributed technologies and machine learning, and the lead technical architect at Methods, the leading digital transformation partner for the UK public sector.
Emmanuel Asimadi is a data scientist currently focusing on natural language processing as applied to the domain of customer experience. He has an MSc in cloud computing from the University of Leicester, UK, with over a decade experience in a variety of analytic roles both in academic research and industry. His varied portfolio includes projects in Apache Spark, natural language processing, semantic web, and telecommunications operations management involving the creation and maintenance of ETL services that support telecom infrastructure operations and maintenance using data from thousands of nodes in the field.
Emmanuel also co-authored a video called Advanced Machine Learning with Spark and has made a significant contribution to the development of the video Big Data Analytics Projects with Apache Spark, which was published recently by Packt Publishing.
If you're interested in becoming an author for Packt, please visit authors.packtpub.com and apply today. We have worked with thousands of developers and tech professionals, just like you, to help them share their insight with the global tech community. You can make a general application, apply for a specific hot topic that we are recruiting an author for, or submit your own idea.
Title Page
Copyright and Credits
Machine Learning with Apache Spark Quick Start Guide
Dedication
About Packt
Why subscribe?
Packt.com
Contributors
About the author
About the reviewer
Packt is searching for authors like you
Preface
Who this book is for
What this book covers
To get the most out of this book
Download the example code files
Conventions used
Get in touch
Reviews
The Big Data Ecosystem
A brief history of data
Vertical scaling
Master/slave architecture
Sharding
Data processing and analysis
Data becomes big
Big data ecosystem
Horizontal scaling
Distributed systems
Distributed data stores
Distributed filesystems
Distributed databases
NoSQL databases
Document databases
Columnar databases
Key-value databases
Graph databases
CAP theorem
Distributed search engines
Distributed processing
MapReduce
Apache Spark
RDDs, DataFrames, and datasets
RDDs
DataFrames
Datasets
Jobs, stages, and tasks
Job
Stage
Tasks
Distributed messaging
Distributed streaming
Distributed ledgers
Artificial intelligence and machine learning
Cloud computing platforms
Data insights platform
Reference logical architecture
Data sources layer
Ingestion layer
Persistent data storage layer
Data processing layer
Serving data storage layer
Data intelligence layer
Unified access layer
Data insights and reporting layer
Platform governance, management, and administration
Open source implementation
Summary
Setting Up a Local Development Environment
CentOS Linux 7 virtual machine
Java SE Development Kit 8
Scala 2.11
Anaconda 5 with Python 3
Basic conda commands
Additional Python packages
Jupyter Notebook
Starting Jupyter Notebook
Troubleshooting Jupyter Notebook
Apache Spark 2.3
Spark binaries
Local working directories
Spark configuration
Spark properties
Environmental variables
Standalone master server
Spark worker node
PySpark and Jupyter Notebook
Apache Kafka 2.0
Kafka binaries
Local working directories
Kafka configuration
Start the Kafka server
Testing Kafka
Summary
Artificial Intelligence and Machine Learning
Artificial intelligence
Machine learning
Supervised learning
Unsupervised learning
Reinforced learning
Deep learning
Natural neuron
Artificial neuron
Weights
Activation function
Heaviside step function
Sigmoid function
Hyperbolic tangent function
Artificial neural network
Single-layer perceptron
Multi-layer perceptron
NLP
Cognitive computing
Machine learning pipelines in Apache Spark
Summary
Supervised Learning Using Apache Spark
Linear regression
Case study – predicting bike sharing demand
Univariate linear regression
Residuals
Root mean square error
R-squared
Univariate linear regression in Apache Spark
Multivariate linear regression
Correlation
Multivariate linear regression in Apache Spark
Logistic regression
Threshold value
Confusion matrix
Receiver operator characteristic curve
Area under the ROC curve
Case study – predicting breast cancer
Classification and Regression Trees
Case study – predicting political affiliation
Random forests
K-Fold cross validation
Summary
Unsupervised Learning Using Apache Spark
Clustering
Euclidean distance
Hierarchical clustering
K-means clustering
Case study – detecting brain tumors
Feature vectors from images
Image segmentation
K-means cost function
K-means clustering in Apache Spark
Principal component analysis
Case study – movie recommendation system
Covariance matrix
Identity matrix
Eigenvectors and eigenvalues
PCA in Apache Spark
Summary
Natural Language Processing Using Apache Spark
Feature transformers
Document
Corpus
Preprocessing pipeline
Tokenization
Stop words
Stemming
Lemmatization
Normalization
Feature extractors
Bag of words
Term frequency–inverse document frequency
Case study – sentiment analysis
NLP pipeline
NLP in Apache Spark
Summary
Deep Learning Using Apache Spark
Artificial neural networks
Multilayer perceptrons
MLP classifier
Input layer
Hidden layers
Output layer
Case study 1 – OCR
Input data
Training architecture
Detecting patterns in the hidden layer
Classifying in the output layer
MLPs in Apache Spark
Convolutional neural networks
End-to-end neural architecture
Input layer
Convolution layers
Rectified linear units
Pooling layers
Fully connected layer
Output layer
Case study 2 – image recognition
InceptionV3 via TensorFlow
Deep learning pipelines for Apache Spark
Image library
PySpark image recognition application
Spark submit
Image-recognition results
Case study 3 – image prediction
PySpark image-prediction application
Image-prediction results
Summary
Real-Time Machine Learning Using Apache Spark
Distributed streaming platform
Distributed stream processing engines
Streaming using Apache Spark
Spark Streaming (DStreams)
Structured Streaming
Stream processing pipeline
Case study – real-time sentiment analysis
Start Zookeeper and Kafka Servers
Kafka topic
Twitter developer account
Twitter apps and the Twitter API
Application configuration
Kafka Twitter producer application
Preprocessing and feature vectorization pipelines
Kafka Twitter consumer application
Summary
Other Books You May Enjoy
Leave a review - let other readers know what you think
Every person and every organization in the world manages data, whether they realize it or not. Data is used to describe the world around us and can be used for almost any purpose, from analyzing consumer habits in order to recommend the latest products and services to fighting disease, climate change, and serious organized crime. Ultimately, we manage data in order to derive value from it, whether personal or business value, and many organizations around the world have traditionally invested in tools and technologies to help them process their data faster and more efficiently in order to deliver actionable insights.
But we now live in a highly interconnected world driven by mass data creation and consumption, where data is no longer rows and columns restricted to a spreadsheet but an organic and evolving asset in its own right. With this realization comes major challenges for organizations as we enter the intelligence-driven fourth industrial revolution—how do we manage the sheer amount of data being created every second in all of its various formats (think not only spreadsheets and databases, but also social media posts, images, videos, music, online forums and articles, computer log files, and more)? And once we know how to manage all of this data, how do we know what questions to ask of it in order to derive real personal or business value?
The focus of this book is to help us answer those questions in a hands-on manner starting from first principles. We introduce the latest cutting-edge technologies (the big data ecosystem, including Apache Spark) that can be used to manage and process big data. We then explore advanced classes of algorithms (machine learning, deep learning, natural language processing, and cognitive computing) that can be applied to the big data ecosystem to help us uncover previously hidden relationships in order to understand what the data is telling us so that we may ultimately solve real-world challenges.
This book is aimed at business analysts, data analysts, data scientists, data engineers, and software engineers for whom a typical day may currently involve analyzing data using spreadsheets or relational databases, perhaps using VBA, Structured Query Language (SQL), or even Python to compute statistical aggregations (such as averages) and to generate graphs, charts, pivot tables and other reporting mediums.
With the explosion of data in all of its various formats and frequencies, perhaps you are now challenged with not only managing all of that data, but understanding what it is telling you. You have most likely heard the terms big data, artificial intelligence, and machine learning, but now wish to understand where to start in order to take advantage of these new technologies and frameworks, not just in theory but in practice as well, to solve your business challenges. If this sounds familiar, then this book is for you!
Chapter 1, The Big Data Ecosystem, provides an introduction to the current big data ecosystem. With the multitude of on-premises and cloud-based technologies, tools, services, libraries, and frameworks available in the big data, artificial intelligence, and machine learning space (and growing every day!), it is vitally important to understand the logical function of each layer within the big data ecosystem so that we may understand how they integrate with each other in order to ultimately architect and engineer end-to-end data intelligence and machine learning pipelines. This chapter also provides a logical introduction to Apache Spark within the context of the wider big data ecosystem.
Chapter 2, Setting Up a Local Development Environment, provides a detailed and hands-on guide to installing, configuring, and deploying a local Linux-based development environment on your personal desktop, laptop, or cloud-based infrastructure. You will learn how to install and configure all the software services required for this book in one self-contained location, including installing and configuring prerequisite programming languages (Java JDK 8 and Python 3), a distributed data processing and analytics engine (Apache Spark 2.3), a distributed real-time streaming platform (Apache Kafka 2.0), and a web-based notebook for interactive data insights and analytics (Jupyter Notebook).
Chapter 3, Artificial Intelligence and Machine Learning, provides a concise theoretical summary of the various applied subjects that fall under the artificial intelligence field of study, including machine learning, deep learning, and cognitive computing. This chapter also provides a logical introduction into how end-to-end data intelligence and machine learning pipelines may be architected and engineered using Apache Spark and its machine learning library, MLlib.
Chapter 4, Supervised Learning Using Apache Spark, provides a hands-on guide to engineering, training, validating, and interpreting the results of supervised machine learning algorithms using Apache Spark through real-world use-cases. The chapter describes and implements commonly used classification and regression techniques including linear regression, logistic regression, classification and regression trees (CART), and random forests.
Chapter 5, Unsupervised Learning Using Apache Spark, provides a hands-on guide to engineering, training, validating, and interpreting the results of unsupervised machine learning algorithms using Apache Spark through real-world use-cases. The chapter describes and implements commonly-used unsupervised techniques including hierarchical clustering, K-means clustering, and dimensionality reduction via Principal Component Analysis (PCA).
Chapter 6, Natural Language Processing Using Apache Spark, provides a hands-on guide to engineering natural language processing (NLP) pipelines using Apache Spark through real-world use-cases. The chapter describes and implements commonly used NLP techniques including tokenisation, stemming, lemmatization, normalization, and other feature transformers, and feature extractors such as the bag of words and Term Frequency-Inverse Document Frequency (TF-IDF) algorithms.
Chapter 7, Deep Learning Using Apache Spark, provides a hands-on exploration of the exciting and cutting-edge world of deep learning! The chapter uses third-party deep learning libraries in conjunction with Apache Spark to train and interpret the results of Artificial Neural Networks (ANNs) including Multi-Layer Perceptrons (MLPs) and Convolutional Neural Networks (CNNs) applied to real-world use-cases.
Chapter 8, Real-Time Machine Learning Using Apache Spark, extends the deployment of machine learning models beyond batch processing in order to learn from data, make predictions, and identify trends in real-time! The chapter provides a hands-on guide to engineering and deploying real-time stream processing and machine learning pipelines using Apache Spark and Apache Kafka to transport, transform, and analyze data streams as they are being created around the world.
Though this book aims to explain everything from first principles, it would be advantageous (though not strictly required) to have a basic knowledge of mathematical notation and basic programming skills in a language that can be used for data transformation, such as SQL, Base SAS, R, or Python. A good website for beginners to learn about SQL and Python is https://www.w3schools.com.
It is assumed that you have access to a physical or virtual machine provisioned with the CentOS Linux 7 (or Red Hat Linux) operating system. If you do not, Chapter 2, Setting Up a Local Development Environment, describes the various options available to provision a CentOS 7 virtual machine (VM), including via cloud-computing platforms such as Amazon Web Services (AWS), Microsoft Azure, and Google Cloud Platform (GCP), virtual private server hosting companies or free virtualization software such as Oracle VirtualBox and VMWare Workstation Player that can be installed on your local physical device, such as a desktop or laptop.
A basic knowledge of Linux shell commands is required in order to install, configure, and provision a self-contained local development environment hosting the prerequisite software services detailed in Chapter 2, Setting Up a Local Development Environment. A good website for beginners to learn about the Linux command line is http://linuxcommand.org.
You can download the example code files for this book from your account at www.packt.com. If you purchased this book elsewhere, you can visit www.packt.com/support and register to have the files emailed directly to you.
You can download the code files by following these steps:
Log in or register at
www.packt.com
.
Select the
SUPPORT
tab.
Click on
Code Downloads & Errata
.
Enter the name of the book in the
Search
box and follow the onscreen instructions.
Once the file is downloaded, please make sure that you unzip or extract the folder using the latest version of:
WinRAR/7-Zip for Windows
Zipeg/iZip/UnRarX for Mac
7-Zip/PeaZip for Linux
The code bundle for the book is also hosted on GitHub athttps://github.com/PacktPublishing/Machine-Learning-with-Apache-Spark-Quick-Start-Guide. In case there's an update to the code, it will be updated on the existing GitHub repository.
We also have other code bundles from our rich catalog of books and videos available athttps://github.com/PacktPublishing/. Check them out!
There are a number of text conventions used throughout this book.
CodeInText: Indicates code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles. Here is an example: "Mount the downloaded WebStorm-10*.dmg disk image file as another disk in your system."
A block of code is set as follows:
import findsparkfindspark.init()from pyspark import SparkContext, SparkConfimport random
Any command-line input or output is written as follows:
> source /etc/profile.d/java.sh
> echo $PATH
> echo $JAVA_HOME
Bold: Indicates a new term, an important word, or words that you see onscreen. For example, words in menus or dialog boxes appear in the text like this. Here is an example: "Select System info from the Administration panel."
Feedback from our readers is always welcome.
General feedback: If you have questions about any aspect of this book, mention the book title in the subject of your message and email us at [email protected].
Errata: Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you have found a mistake in this book, we would be grateful if you would report this to us. Please visit www.packt.com/submit-errata, selecting your book, clicking on the Errata Submission Form link, and entering the details.
Piracy: If you come across any illegal copies of our works in any form on the Internet, we would be grateful if you would provide us with the location address or website name. Please contact us at [email protected] with a link to the material.
If you are interested in becoming an author: If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, please visit authors.packtpub.com.
Please leave a review. Once you have read and used this book, why not leave a review on the site that you purchased it from? Potential readers can then see and use your unbiased opinion to make purchase decisions, we at Packt can understand what you think about our products, and our authors can see your feedback on their book. Thank you!
For more information about Packt, please visit packt.com.
Modern technology has transformed the very essence of what we mean by data. Whereas previously, data was traditionally thought of as text and numbers confined to spreadsheets or relational databases, today, it is an organic and evolving asset in its own right, being created and consumed on a mass scale by anyone that owns a smartphone, TV, or bank account. In this chapter, we will explore the new ecosystem of cutting-edge tools, technologies, and frameworks that allow us to store, process, and analyze massive volumes of data in order to deliver actionable insights and solve real-world problems. By the end of this chapter, you will have gained a high-level understanding of the following cutting-edge technology classes:
Distributed systems
NoSQL databases
Artificial intelligence and machine learning frameworks
Cloud computing platforms
Big data platforms and reference architecture
If you worked in the mainstream IT industry between the 1970s and early 2000s, it is likely that your organization's data was held either in text-based delimited files, spreadsheets, or nicely structured relational databases. In the case of the latter, data is modeled and persisted in pre-defined, and possibly related, tables representing the various entities found within your organization's data model, for example, according to employee or department. These tables contain rows of data across multiple columns representing the various attributes making up that entity; for example, in the case of employee, typical attributes include first name, last name, and date of birth.
As both your organization's data estate and the number of users requiring access to that data grew, high-performance remote servers would have been utilized, with access provisioned over the corporate network. These remote servers would typically either act as remote filesystems for file sharing or host relational database management systems (RDBMSes) in order to store and manage relational databases. As data requirements grew, these remote servers would have needed to scale vertically, meaning that additional CPU, memory, and/or hard disk space would have been installed. Typically, these relational databases would have stored anything between hundreds and potentially tens of millions of records.
As a means of providing resilience and load balancing read requests, potentially, a master/slave architecture would have been employed whereby data is automatically copied from the master database server to physically distinct slave database server(s) utilizing near real-time replication. This technique requires that the master server be responsible for all write requests, while read requests could be offloaded and load balanced across the slaves, where each slave would hold a full copy of the master data. That way, if the master server ever failed for some reason, business-critical read requests could still be processed by the slaves while the master was being brought back online. This technique does have a couple of major disadvantages, however:
Scalability
: T
he master server, by being solely responsible for processing write requests, limits the ability for the system to be scalable as it could quickly become a bottleneck.
Consistency and data loss
: Since replication is near
real-time, it is not guaranteed that the slaves would have the latest data at the point in time that the master server goes offline and transactions may be lost. Depending on the business application, either not having the latest data or losing data may be unacceptable.
To increase throughput and overall performance, and as single machines reached their capacity to scale vertically in a cost-effective manner, it is possible that sharding would have been employed. This is one method of horizontal scaling whereby additional servers are provisioned and data is physically split over separate database instances residing on each of the machines in the cluster, as illustrated in Figure 1.1.
This approach would have allowed organizations to scale linearly to cater for increased data sizes while reusing existing database technologies and commodity hardware, thereby optimizing costs and performance for small- to medium-sized databases.
Crucially, however, these separate databases are standalone instances and have no knowledge of one another. Therefore, some sort of broker would be required that, based on a partitioning strategy, would keep track of where data was being written to for each write request and, thereafter, retrieve data from that same location for read requests. Sharding subsequently introduced further challenges, such as processing data queries, transformations, and joins that spanned multiple standalone database instances across multiple servers (without denormalizing data), thereby maintaining referential integrity and the repartitioning of data:
Finally, in order to transform, process, and analyze the data sitting in these delimited text-based files, spreadsheets or relational databases, typically an analyst, data engineer or software engineer would have written some code.
This code, for example, could take the form of formulas or Visual Basic for Applications (VBA) for spreadsheets, or Structured Query Language (SQL) for relational databases, and would be used for the following purposes:
Loading data, including batch loading and data migration
Transforming data, including data cleansing, joins, merges, enrichment, and validation
Standard statistical aggregations, including computing averages, counts, totals, and pivot tables
Reporting, including graphs, charts, tables, and dashboards
To perform more complex statistical calculations, such as generating predictive models, advanced analysts could utilize more advanced programming languages, including Python, R, SAS, or even Java.
Crucially, however, this data transformation, processing, and analysis would have either been executed directly on the server in which the data was persisted (for example, SQL statements executed directly on the relational database server in competition with other business-as-usual read and write requests), or data would be moved over the network via a programmatic query (for example, an ODBC or JDBC connection), or via flat files (for example, CSV or XML files) to another remote analytical processing server. The code could then be executed on that data, assuming, of course, that the remote processing server had sufficient CPUs, memory and/or disk space in its single machine to execute the job in question. In other words, the data would have been moved to the code in some way or another.
Fast forward to today—spreadsheets are still commonplace, and relational databases containing nicely structured data, whether partitioned across shards or not, are still very much relevant and extremely useful. In fact, depending on the use case, the data volumes, structure, and the computational complexity of the required processing, it could still be faster and more efficient to store and manage data via an RDBMS and process that data directly on the remote database server using SQL. And, of course, spreadsheets are still great for very small datasets and for simple statistical aggregations. What has changed, however, since the 1970s is the availability of more powerful and more cost-effective technology coupled with the introduction of the internet!
The internet has transformed the very essence of what we mean by data. Whereas before, data was thought of as text and numbers confined to spreadsheets or relational databases, it is now an organic and evolving asset in its own right being created and consumed on a mass scale by anyone that owns a smartphone, TV, or bank account. Data is being created every second around the world in virtually any format you can think of, from social media posts, images, videos, audio, and music to blog posts, online forums, articles, computer log files, and financial transactions. All of this structured, semi-structured, and unstructured data being created in both batche and real time can no longer be stored and managed by nicely organized, text-based delimited files, spreadsheets, or relational databases, nor can it all be physically moved to a remote processing server every time some analytical code is to be executed—a new breed of technology is required.
If you work in almost any mainstream industry today, chances are that you may have heard of some of the following terms and phrases:
Big data
Distributed, scalable, and elastic
On-premise versus the cloud
SQL versus NoSQL
Artificial intelligence, machine learning, and deep learning
But what do all these terms and phrases actually mean, how do they all fit together, and where do you start? The aim of this section is to answer all of those questions in a clear and concise manner.
First of all, let's return to some of the data-centric problems that we described earlier. Given the huge explosion in the mass creation and consumption of data today, clearly we cannot continue to keep adding CPUs, memory, and/or hard drives to a single machine (in other words, vertical scaling). If we did, there would very quickly come a point where migrating to more powerful hardware would lead to diminishing returns while incurring significant costs. Furthermore, the ability to scale would be physically bounded by the biggest machine available to us, thereby limiting the growth potential of an organization.
Horizontal scaling, of which sharding is an example, is the process by which we can increase or decrease the amount of computational resources available to us via the addition or removal of hardware and/or software. Typically, this would involve the addition (or removal) of servers or nodes to a cluster of nodes. Crucially, however, the cluster acts as a single logical unit at all times, meaning that it will still continue to function and process requests regardless of whether resources were being added to it or taken away. The difference between horizontal and vertical scaling is illustrated in Figure 1.2:
Horizontal scaling allows organizations to become much more cost efficient when data and processing requirements grow beyond a certain point. But simply adding more machines to a cluster would not be of much value by itself. What we now need are systems that are capable of taking advantage of horizontal scalability and that work across multiple machines seamlessly, irrespective of whether the cluster contains one machine or 10,000 machines.
Distributed systems do precisely that—they work seamlessly across a cluster of machines and automatically deal with the addition (or removal) of resources from that cluster. Distributed systems can be broken down into the following types:
Distributed filesystems
Distributed databases
Distributed processing
Distributed messaging
Distributed streaming
Distributed ledgers
