Fast Data Processing Systems with SMACK Stack - Raúl Estrada - E-Book

Fast Data Processing Systems with SMACK Stack E-Book

Raul Estrada

0,0
39,59 €

-100%
Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.

Mehr erfahren.
Beschreibung

SMACK is an open source full stack for big data architecture. It is a combination of Spark, Mesos, Akka, Cassandra, and Kafka. This stack is the newest technique developers have begun to use to tackle critical real-time analytics for big data. This highly practical guide will teach you how to integrate these technologies to create a highly efficient data analysis system for fast data processing.
We’ll start off with an introduction to SMACK and show you when to use it. First you’ll get to grips with functional thinking and problem solving using Scala. Next you’ll come to understand the Akka architecture. Then you’ll get to know how to improve the data structure architecture and optimize resources using Apache Spark.
Moving forward, you’ll learn how to perform linear scalability in databases with Apache Cassandra. You’ll grasp the high throughput distributed messaging systems using Apache Kafka. We’ll show you how to build a cheap but effective cluster infrastructure with Apache Mesos. Finally, you will deep dive into the different aspect of SMACK using a few case studies.
By the end of the book, you will be able to integrate all the components of the SMACK stack and use them together to achieve highly effective and fast data processing.

Das E-Book können Sie in Legimi-Apps oder einer beliebigen App lesen, die das folgende Format unterstützen:

EPUB
MOBI

Seitenzahl: 372

Veröffentlichungsjahr: 2016

Bewertungen
0,0
0
0
0
0
0
Mehr Informationen
Mehr Informationen
Legimi prüft nicht, ob Rezensionen von Nutzern stammen, die den betreffenden Titel tatsächlich gekauft oder gelesen/gehört haben. Wir entfernen aber gefälschte Rezensionen.



Table of Contents

Fast Data Processing Systems with SMACK Stack
Credits
About the Author
About the Reviewers
www.PacktPub.com
Why subscribe?
Preface
What this book covers
What you need for this book
Who this book is for
Conventions
Reader feedback
Customer support
Downloading the example code
Downloading the color images of this book
Errata
Piracy
Questions
1. An Introduction to SMACK
Modern data-processing challenges
The data-processing pipeline architecture
The NoETL manifesto
Lambda architecture
Hadoop
SMACK technologies
Apache Spark
Akka
Apache Cassandra
Apache Kafka
Apache Mesos
Changing the data center operations
From scale-up to scale-out
The open-source predominance
Data store diversification
Data gravity and data locality
DevOps rules
Data expert profiles
Data architects
Data engineers
Data analysts
Data scientists
Is SMACK for me?
Summary
2. The Model - Scala and Akka
The language - Scala
Kata 1 - The collections hierarchy
Sequence
Map
Set
Kata 2 - Choosing the right collection
Sequence
Map
Set
Kata 3 - Iterating with foreach
Kata 4 - Iterating with for
Kata 5 - Iterators
Kata 6 - Transforming with map
Kata 7 - Flattening
Kata 8 - Filtering
Kata 9 - Subsequences
Kata 10 - Splitting
Kata 11 - Extracting unique elements
Kata 12 - Merging
Kata 13 - Lazy views
Kata 14 - Sorting
Kata 15 - Streams
Kata 16 - Arrays
Kata 17 - ArrayBuffer
Kata 18 - Queues
Kata 19 - Stacks
Kata 20 - Ranges
The model - Akka
The Actor Model in a nutshell
Kata 21 - Actors
The actor system
Actor reference
Kata 22 - Actor communication
Kata 23 - Actor life cycle
Kata 24 - Starting actors
Kata 25 - Stopping actors
Kata 26 - Killing actors
Kata 27 - Shutting down the actor system
Kata 28 - Actor monitoring
Kata 29 - Looking up actors
Summary
3. The Engine - Apache Spark
Spark in single mode
Downloading Apache Spark
Testing Apache Spark
Spark core concepts
Resilient distributed datasets
Running Spark applications
Initializing the Spark context
Spark applications
Running programs
RDD operation
Transformations
Actions
Persistence (caching)
Spark in cluster mode
Runtime architecture
Driver
Dividing a program into tasks
Scheduling tasks on executors
Executor
Cluster manager
Program execution
Application deployment
Standalone cluster manager
Launching the standalone manager
Submitting our application
Configuring resources
Working in the cluster
Spark Streaming
Spark Streaming architecture
Transformations
Stateless transformations
Stateful transformations
Windowed operations
Update state by key
Output operations
Fault-tolerant Spark Streaming
Checkpointing
Spark Streaming performance
Parallelism level
Window size and batch size
Garbage collector
Summary
4. The Storage - Apache Cassandra
A bit of history
NoSQL
NoSQL or SQL?
CAP Brewer's theorem
Apache Cassandra installation
Data model
Data storage
Installation
DataStax OpsCenter
Creating a key space
Authentication and authorization (roles)
Setting up a simple authentication and authorization
Backup
Compression
Recovery
Restart node
Printing schema
Logs
Configuring log4j
Log file rotation
User activity log
Transaction log
SQL dump
CQL
CQL commands
DBMS Cluster
Deleting the database
CLI delete commands
CQL shell delete commands
DB and DBMS optimization
Bloom filter
Data cache
Java heap tune up
Java garbage collection tune up
Views, triggers, and stored procedures
Client-server architecture
Drivers
Spark-Cassandra connector
Installing the connector
Establishing the connection
Using the connector
Summary
5. The Broker - Apache Kafka
Introducing Kafka
Features of Apache Kafka
Born to be fast data
Use cases
Installation
Installing Java
Installing Kafka
Importing Kafka
Cluster
Single node - single broker cluster
Starting Zookeeper
Starting the broker
Creating a topic
Starting a producer
Starting a consumer
Single node - Multiple broker cluster
Starting the brokers
Creating a topic
Starting a producer
Starting a consumer
Multiple node - multiple broker cluster
Broker properties
Architecture
Segment files
Offset
Leaders
Groups
Log compaction
Kafka design
Message compression
Replication
Asynchronous replication
Synchronous replication
Producers
Producer API
Scala producers
Step 1: Import classes
Step 2: Define properties
Step 3: Build and send the message
Step 4: Create the topic
Step 5: Compile the producer
Step 6: Run the producer
Step 7: Run a consumer
Producers with custom partitioning
Step 1: Import classes
Step 2: Define properties
Step 3: Implement the partitioner class
Step 4: Build and send the message
Step 5: Create the topic
Step 6: Compile the programs
Step 7: Run the producer
Step 8: Run a consumer
Producer properties
Consumers
Consumer API
Simple Scala consumers
Step 1: Import classes
Step 2: Define properties
Step 3: Code the SimpleConsumer
Step 4: Create the topic
Step 5: Compile the program
Step 6: Run the producer
Step 7: Run the consumer
Multithread Scala consumers
Step 1: Import classes
Step 2: Define properties
Step 3: Code the MultiThreadConsumer
Step 4: Create the topic
Step 5: Compile the program
Step 6: Run the producer
Step 7: Run the consumer
Consumer properties
Integration
Integration with Apache Spark
Administration
Cluster tools
Adding servers
Kafka topic tools
Cluster mirroring
Summary
6. The Manager - Apache Mesos
The Apache Mesos architecture
Frameworks
Existing Mesos frameworks
Frameworks for long running applications
Frameworks for scheduling
Frameworks for storage
Attributes and resources
Attributes
Resources
The Apache Mesos API
Messages
The Executor API
Executor Driver API
The Scheduler API
The Scheduler Driver API
Resource allocation
The DRF algorithm
Weighted DRF algorithm
Resource configuration
Resource reservation
Static reservation
Defining roles
Assigning frameworks to roles
Setting policies
Dynamic reservation
The reserve operation
The unreserve operation
HTTP reserve
HTTP unreserve
Running a Mesos cluster on AWS
AWS instance types
AWS instances launching
Installing Mesos on AWS
Downloading Mesos
Building Mesos
Launching several instances
Running a Mesos cluster on a private data center
Mesos installation
Setting up the environment
Start the master
Start the slaves
Process automation
Common Mesos issues
Missing library dependencies
Directory permissions
Missing library
Debugging
Directory structure
Slaves not connecting with masters
Multiple slaves on the same machine
Scheduling and management frameworks
Marathon
Marathon installation
Installing Apache Zookeeper
Running Marathon in local mode
Multi-node Marathon installation
Running a test application from the web UI
Application scaling
Terminating the application
Chronos
Chronos installation
Job scheduling
Chronos and Marathon
Chronos REST API
Listing running jobs
Starting a job manually
Adding a job
Deleting a job
Deleting all the job tasks
Marathon REST API
Listing the running applications
Adding an application
Changing the application configuration
Deleting the application
Apache Aurora
Installing Aurora
Singularity
Singularity installation
The Singularity configuration file
Apache Spark on Apache Mesos
Submitting jobs in client mode
Submitting jobs in cluster mode
Advanced configuration
Apache Cassandra on Apache Mesos
Advanced configuration
Apache Kafka on Apache Mesos
Kafka log management
Summary
7. Study Case 1 - Spark and Cassandra
Spark Cassandra connector
Requisites
Preparing Cassandra
SparkContext setup
Cassandra and Spark Streaming
Spark Streaming setup
Cassandra setup
Streaming context creation
Stream creation
Kafka Streams
Akka Streams
Enabling Cassandra
Write the Stream to Cassandra
Read the Stream from Cassandra
Saving datasets to Cassandra
Saving a collection of tuples to Cassandra
Saving collections to Cassandra
Modifying collections
Saving objects of Cassandra (user defined types)
Scala options to Cassandra options conversion
Saving RDDs as new tables
Cluster deployment
Spark Cassandra use cases
Study case: The Calliope project
Installing Calliope
CQL3
Read from Cassandra with CQL3
Write to Cassandra with CQL3
Thrift
Read from Cassandra with Thrift
Write to Cassandra with Thrift
Calliope SQL context creation
Calliope SQL Configuration
Loading Cassandra tables programmatically
Summary
8. Study Case 2 - Connectors
Akka and Cassandra
Writing to Cassandra
Reading from Cassandra
Connecting to Cassandra
Scanning tweets
Testing the scanner
Akka and Spark
Kafka and Akka
Kafka and Cassandra
Summary
9. Study Case 3 - Mesos and Docker
Mesos frameworks API
Authentication, authorization, and access control
Framework authentication
Authentication configuration
Framework authorization
Access control lists
Spark Mesos run modes
Coarse-grained
Fine-grained
Apache Mesos API
Scheduler HTTP API
Requests
SUBSCRIBE
TEARDOWN
ACCEPT
DECLINE
REVIVE
KILL
SHUTDOWN
ACKNOWLEDGE
RECONCILE
MESSAGE
REQUEST
Responses
SUBSCRIBED
OFFERS
RESCIND
UPDATE
MESSAGE
FAILURE
ERROR
HEARTBEAT
Mesos containerizers
Containers
Docker containerizers
Containers and containerizers
Types of containerizers
Creating containerizers
Mesos containerizer
Launching Mesos containerizer
Architecture of Mesos  containerizer
Shared filesystem
PID namespace
Posix disk
Docker  containerizers
Docker containerizer setup
Launching the Docker  containerizers
Composing  containerizers
Summary

Fast Data Processing Systems with SMACK Stack

Fast Data Processing Systems with SMACK Stack

Copyright © 2016 Packt Publishing

All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing, and its dealers and distributors will be held liable for any damages caused or alleged to be caused directly or indirectly by this book.

Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

Production reference: 1151216

Published by Packt Publishing Ltd.

Livery Place

35 Livery Street

Birmingham 

B3 2PB, UK.

ISBN 978-1-78646-720-1

www.packtpub.com

Credits

Author

Raúl Estrada

Copy Editor

Safis Editing

Reviewers

Anton Kirillov

Sumit Pal

Project Coordinator

Shweta H Birwatkar 

Commissioning Editor

Veena Pagare

Proofreader

Safis Editing

Acquisition Editor

Divya Poojari

Indexer

Mariammal Chettiyar

Content Development Editor

Amrita Noronha

Graphics

Disha Haria

Technical Editor

Sneha Hanchate

Production Coordinator

Nilesh Mohite

About the Author

Raúl Estrada is a programmer since 1996 and Java Developer since 2001. He loves functional languages such as Scala, Elixir, Clojure, and Haskell. He also loves all the topics related to Computer Science. With more than 12 years of experience in High Availability and Enterprise Software, he has designed and implemented architectures since 2003.

His specialization is in systems integration and has participated in projects mainly related to the financial sector. He has been an enterprise architect for BEA Systems and Oracle Inc., but he also enjoys Mobile Programming and Game Development. He considers himself a programmer before an architect, engineer, or developer.

He is also a Crossfitter in San Francisco, Bay Area, now focused on Open Source projects related to Data Pipelining such as Apache Flink, Apache Kafka, and Apache Beam. Raul is a supporter of free software, and enjoys to experiment with new technologies, frameworks, languages, and methods.

I want to thank my family, especially my mom for her patience and dedication.

I would like to thank Master Gerardo Borbolla and his family for the support and feedback they provided on this book writing.

I want to say thanks to the acquisition editor, Divya Poojari, who believed in this project since the beginning.

I also thank my editors Deepti Thore and Amrita Noronha. Without their effort and patience, it would not have been possible to write this book.

And finally, I want to thank all the heroes who contribute (often anonymously and without profit) with the Open Source projects specifically: Spark, Mesos, Akka, Cassandra, and Kafka; an honorable mention for those who build the connectors of these technologies.

About the Reviewers

Anton Kirillov started his career as a Java developer in 2007, working on his PhD thesis in the Semantic Search domain at the same time. After finishing and defending his thesis, he switched to Scala ecosystem and distributed systems development. He worked for and consulted startups focused on Big Data analytics in various domains (real-time bidding, telecom, B2B advertising, and social networks) in which his main responsibilities were focused on designing data platform architectures and further performance and stability validation. Besides helping startups, he has worked in the bank industry building Hadoop/Spark data analytics solutions and in a mobile games company where he has designed and implemented several reporting systems and a backend for a massive parallel online game.

The main technologies that Anton has been using for the recent years include Scala, Hadoop, Spark, Mesos, Akka, Cassandra, and Kafka and there are a number of systems he’s built from scratch and successfully released using these technologies. Currently, Anton is working as a Staff Engineer in Ooyala Data Team with focus on fault-tolerant fast analytical solutions for the ad serving/reporting domain.

Sumit Pal has more than 24 years of experience in the Software Industry, spanning companies from startups to enterprises. He is a big data architect, visualization and data science consultant, and builds end-to-end data-driven analytic systems. Sumit has worked for Microsoft (SQLServer), Oracle (OLAP), and Verizon (Big Data Analytics). Currently, he works for multiple clients building their data architectures and big data solutions and works with Spark, Scala, Java, and Python. He has extensive experience in building scalable systems in middletier, datatier to visualization for analytics applications, using BigData and NoSQL DB. Sumit has expertise in DataBase Internals, Data Warehouses, Dimensional Modeling, As an Associate Director for Big Data at Verizon, Sumit, strategized, managed, architected and developed analytic platforms for machine learning applications. Sumit was the Chief Architect at ModelN/LeapfrogRX (2006-2013), where he architected the core Analytics Platform.

Sumit has recently authored a book with Apress - called - "SQL On Big Data - Technology, Architecture and Roadmap". Sumit regularly speaks on the above topic in Big Data Conferences across USA.

Sumit has hiked to Mt. Everest Base Camp at 18.2K feet in Oct, 2016. Sumit is also an avid Badminton player and has won a bronze medal in 2015 in Connecticut Open in USA in the men's single category.

www.PacktPub.com

For support files and downloads related to your book, please visit www.PacktPub.com.

Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.PacktPub.com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at [email protected] for more details.

At www.PacktPub.com, you can also read a collection of free technical articles, sign up for a range of free newsletters and receive exclusive discounts and offers on Packt books and eBooks.

https://www.packtpub.com/mapt

Get the most in-demand software skills with Mapt. Mapt gives you full access to all Packt books and video courses, as well as industry-leading tools to help you plan your personal development and advance your career.

Why subscribe?

Fully searchable across every book published by PacktCopy and paste, print, and bookmark contentOn demand and accessible via a web browser

What you need for this book

The reader should have some experience in programming (Java or Scala), some experience in Linux/Unix operating systems and the basics of databases:

For Scala, the reader should know the basics about programming
For Spark, the reader should know the fundamentals of Scala Programming LanguageFor Mesos, the reader should know the basics of the Operating Systems administrationFor Cassandra, the reader should know the fundamentals of DatabasesFor Kafka, the reader should have basic knowledge about Scala

Who this book is for

This book is for software developers, data architects, and data engineers looking for how to integrate the most successful Open Source Data stack architecture and how to choose the correct technology in every layer and also what are the practical benefits in every case.

There are a lot of books that talk about each technology separately. This book is for people looking for alternative technologies and practical examples on how to connect the entire stack.

Conventions

In this book, you will find a number of styles of text that distinguish between different kinds of information. Here are some examples of these styles, and an explanation of their meaning.

Code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles are shown as follows: "In the case of HDFS, we should change the mesos.hdfs.role in the file mesos-site.xml to the value of role1."

A block of code is set as follows:

[default]exten => s,1,Dial(Zap/1|30)exten => s,2,Voicemail(u100)exten => s,102,Voicemail(b100)exten => i,1,Voicemail(s0)

When we wish to draw your attention to a particular part of a code block, the relevant lines or items are set in bold:

[default]exten => s,1,Dial(Zap/1|30)exten => s,2,Voicemail(u100)exten => s,102,Voicemail(b100)exten => i,1,Voicemail(s0)

Any command-line input or output is written as follows:

# cp /usr/src/asterisk-addons/configs/cdr_mysql.conf.sample     /etc/asterisk/cdr_mysql.conf

New terms and important words are shown in bold. Words that you see on the screen, in menus or dialog boxes for example, appear in the text like this: "clicking the Next button moves you to the next screen".

Note

Warnings or important notes appear in a box like this.

Tip

Tips and tricks appear like this.

Reader feedback

Feedback from our readers is always welcome. Let us know what you think about this book-what you liked or disliked. Reader feedback is important for us as it helps us develop titles that you will really get the most out of. To send us general feedback, simply e-mail [email protected], and mention the book's title in the subject of your message. If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, see our author guide at www.packtpub.com/authors.

Customer support

Now that you are the proud owner of a Packt book, we have a number of things to help you to get the most from your purchase.

Downloading the example code

You can download the example code files for this book from your account at http://www.packtpub.com. If you purchased this book elsewhere, you can visit http://www.packtpub.com/support and register to have the files e-mailed directly to you.

You can download the code files by following these steps:

Log in or register to our website using your e-mail address and password.Hover the mouse pointer on the SUPPORT tab at the top.Click on Code Downloads & Errata.Enter the name of the book in the Search box.Select the book for which you're looking to download the code files.Choose from the drop-down menu where you purchased this book from.Click on Code Download.

Once the file is downloaded, please make sure that you unzip or extract the folder using the latest version of:

WinRAR / 7-Zip for WindowsZipeg / iZip / UnRarX for Mac7-Zip / PeaZip for Linux

The code bundle for the book is also hosted on GitHub at https://github.com/PacktPublishing/Fast-Data-Processing-Systems-with-SMACK-Stack. We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/. Check them out!

Downloading the color images of this book

We also provide you with a PDF file that has color images of the screenshots/diagrams used in this book. The color images will help you better understand the changes in the output. You can download this file from https://www.packtpub.com/sites/default/files/downloads/FastDataProcessingSystemswithSMACKStack_ColorImages.pdf.

Errata

Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you find a mistake in one of our books-maybe a mistake in the text or the code-we would be grateful if you could report this to us. By doing so, you can save other readers from frustration and help us improve subsequent versions of this book. If you find any errata, please report them by visiting http://www.packtpub.com/submit-errata, selecting your book, clicking on the Errata Submission Form link, and entering the details of your errata. Once your errata are verified, your submission will be accepted and the errata will be uploaded to our website or added to any list of existing errata under the Errata section of that title.

To view the previously submitted errata, go to https://www.packtpub.com/books/content/support and enter the name of the book in the search field. The required information will appear under the Errata section.

Piracy

Piracy of copyrighted material on the Internet is an ongoing problem across all media. At Packt, we take the protection of our copyright and licenses very seriously. If you come across any illegal copies of our works in any form on the Internet, please provide us with the location address or website name immediately so that we can pursue a remedy.

Please contact us at [email protected] with a link to the suspected pirated material.

We appreciate your help in protecting our authors and our ability to bring you valuable content.

Questions

If you have a problem with any aspect of this book, you can contact us at [email protected], and we will do our best to address the problem.

Chapter 1.  An Introduction to SMACK

The goal of this chapter is to present data problems and scenarios solved by architecture. This chapter explains how every technology contributes to the SMACK stack. It also explains how this modern pipeline architecture solves many of the modern problems related to data-processing environments. Here we will know when to use SMACK and when it is not suitable. We will also touch on the new professional profiles created in the new data management era.

In this chapter we will cover the following topics:

Modern data-processing challengesThe data-processing pipeline architectureSMACK technologiesChanging the data center operationsData expert profilesIs SMACK for me?

Modern data-processing challenges

We can enumerate four modern data-processing problems as follows:

Size matters: In modern times, data is getting bigger or, more accurately, the number of available data sources is increasing. In the previous decade, we could precisely identify our company's internal data sources: Customer Relationship Management (CRM), Point of Sale (POS), Enterprise Resource Planning (ERP), Supply Chain Management (SCM), and all our databases and legacy systems. Easy, a system that is not internal is external. Today, it is exactly the same, except not do the data sources multiply over time, the amount of information flowing from external systems is also growing at almost logarithmic rates. New data sources include social networks, banking systems, stock systems, tracking and geolocation systems, monitoring systems, sensors, and the Internet of Things; if a company's architecture is incapable of handling these use cases, then it can't respond to upcoming challenges.Sample data: Obtaining a sample of production data is becoming more difficult. In the past, data analysts could have a fresh copy of production data on their desks almost daily. Today, it becomes increasingly more difficult, either because of the amount of data to be moved or by the expiration date; in many modern business models data from an hour ago is practically obsolete.Data validity: The validity of an analysis becomes obsolete faster. Assuming that the fresh-copy problem is solved, how often is new data needed? Looking for a trend in the last year is different from looking for one in the last few hours. If samples from a year ago are needed, what is the frequency of these samples? Many modern businesses don't even have this information, or worse, they have it but it is only stored.Data Return on Investment (ROI): Data analysis becomes too slow to get any return on investment from the info. Now, suppose you have solved the problems of sample data and data validity. The challenge is to be able to analyze information in a timely manner so that the return on investment of all our efforts is profitable. Many companies invest in data, but never get the analysis to increase their income.

We can enumerate modern data needs which are as follows:

Scalable infrastructure: Companies, every time, have to weigh the time and money spent. Scalability in a data center means the center should grow in proportion to the business growth. Vertical scalability involves adding more layers of processing. Horizontal scalability means that once a layer has more demands and requires more infrastructures, hardware can be added so that processing needs are met. One modern requirement is to have horizontal scaling with low-cost hardware.Geographically dispersed data centers: Geographically centralized data centers are being displaced. This is because companies need to have multiple data centers in multiple locations for several reasons: cost, ease of administration, or access to users. This implies a huge challenge for data center management. On the other hand, data center unification is a complex task.Allow data volumes to be scaled as the business needs: The volume of data must scale dynamically according to business demands. So, as you can have a lot of demand at a certain time of day, you can have high demand in certain geographic regions. Scaling should be dynamically possible in time and space especially horizontally.Faster processing: Today, being able to work in real time is fundamental. We live in an age where data freshness matters many times more than the amount or size of data. If the data is not processed fast enough, it becomes stale quickly. Fresh information not only needs to be obtained in a fast way, it has to be processed quickly.Complex processing: In the past, the data was smaller and simpler. Raw data doesn't help us much. The information must be processed by several layers, efficiently. The first layers are usually purely technical and the last layers mainly business-oriented. Processing complexity can kill of the best business ideas.Constant data flow: For cost reasons, the number of data warehouses is decreasing. The era when data warehouses served just to store data is dying. Today, no one can afford data warehouses just to store information. Today, data warehouses are becoming very expensive and meaningless. The better business trend is towards flows or streams of data. Data no longer stagnates, it moves like large rivers. Make data analysis on big information torrents one of the objectives of modern businesses.Visible, reproducible analysis: If we cannot reproduce phenomena, we cannot call ourselves scientists. Modern science data requires making reports and graphs in real time to take timely decisions. The aim of science data is to make effective predictions based on observation. The process should be visible and reproducible.

The data-processing pipeline architecture

If you ask several people from the information technology world, we agree on few things, except that we are always looking for a new acronym, and the year 2015 was no exception.

As this book title says, SMACK stands for Spark, Mesos, Akka, Cassandra, and Kafka. All these technologies are open source. And with the exception of Akka, all are Apache Software projects. This acronym was coined by Mesosphere, a company that bundles these technologies together in a product called Infinity, designed in collaboration with Cisco to solve some pipeline data challenges where the speed of response is fundamental, such as in fraud detection engines.

SMACK exists because one technology doesn't make an architecture. SMACK is a pipelined architecture model for data processing. A data pipeline is software that consolidates data from multiple sources and makes it available to be used strategically.

It is called a pipeline because each technology contributes with its characteristics to a processing line similar to a traditional industrial assembly line. In this context, our canonical reference architecture has four parts: storage, the message broker, the engine, and the hardware abstraction.

For example, Apache Cassandra alone solves some problems that a modern database can solve but, given its characteristics, leads the storage task in our reference architecture.

Similarly, Apache Kafka was designed to be a message broker, and by itself solves many problems in specific businesses; however, its integration with other tools deserves a special place in our reference architecture over its competitors.

The NoETL manifesto

The acronym ETL stands for Extract, Transform, Load. In the database data warehousing guide, Oracle says:

Designing and maintaining the ETL process is often considered one of the most difficult and resource intensive portions of a data warehouse project.

For more information, refer to http://docs.oracle.com/cd/B19306_01/server.102/b14223/ettover.htm.

Contrary to many companies' daily operations, ETL is not a goal, it is a step, a series of unnecessary steps:

Each ETL step can introduce errors and riskIt can duplicate data after failoverTools can cost millions of dollarsIt decreases throughputIt increases complexityIt writes intermediary filesIt parses and re-parses plain textIt duplicates the pattern over all our data centers

No ETL pipelines fit on the SMACK stack: Spark, Mesos, Akka, Cassandra, and Kafka. And if you use SMACK, make sure it's highly-available, resilient, and distributed.

A good sign you're having Etlitis is writing intermediary files. Files are useful in day to day work, but as data types they are difficult to handle. Some programmers advocate replacing a file system with a better API.

Removing the E in ETL: Instead of text dumps that you need to parse over multiple systems, Scala and Parquet technologies, for example, can work with binary data that remains strongly typed and represent a return to strong typing in the data ecosystem.

Removing the L in ETL: If data collection is backed by a distributed messaging system (Kafka, for example) you can do a real-time fan-out of the ingested data to all customers. No need to batch-load.

The T in ETL: From this architecture, each consumer can do their own transformations.

So, the modern tendency is: no more Greek letter architectures, no more ETL.

Lambda architecture

The academic definition is a data-processing architecture designed to handle massive quantities of data by taking advantage of both batch and stream processing methods. The problem arises when we need to process data streams in real time.

Here, a special mention for two open source projects that allow batch processing and real-time stream processing in the same application: Apache Spark and Apache Flink. There is a battle between these two: Apache Spark is the solution led by Databricks, and Apache Flink is a solution led by data artisans.

For example, Apache Spark and Apache Cassandra meets two modern requirements described previously:

It handles a massive data stream in real timeIt handles multiple and different data models from multiple data sources

Most lambda solutions, as mentioned, cannot meet these two needs at the same time. As a demonstration of power, using an architecture based only on these two technologies, Apache Spark is responsible for real-time analysis of both historical data and recent data obtained from the massive information torrent. All such information and analysis results are persisted in Apache Cassandra. So, in the case of failure we can recover real-time data from any point of time. With lambda architecture it's not always possible.

Hadoop

Hadoop was designed to transfer processing closer to the data to minimize the amount of data shuffled across the network. It was designed with data warehouse and batch problems in mind; it fits into the slow data category, where size, scope, and completeness of data are more important than the speed of response.

The analogy is the sea versus the waterfall. In a sea of information you have a huge amount of data, but it is a static, contained, motionless sea, perfect to do Batch processing without time pressures. In a waterfall you have a huge amount of data, dynamic, not contained, and in motion. In this context your data often has an expiration date; after time passes it is useless.

Some Hadoop adopters have been left questioning the true return on investment of their projects after running for a while; this is not a technological fault itself, but a case of whether it is the right application. SMACK has to be analyzed in the same way.

SMACK technologies

SMACK is about a full stack for pipeline data architecture--it's Spark, Mesos, Akka, Cassandra, and Kafka. Further on in the book, we will also talk about the most important factor: the integration of these technologies.

Pipeline data architecture is required for online data stream processing, but there are a lot of books talking about each technology separately. This book talks about the entire full stack and how to perform integration.

This book is a compendium of how to integrate these technologies in a pipeline data architecture.

We talk about the five main concepts of pipeline data architecture and how to integrate, replace, and reinforce every layer:

The engine: Apache SparkThe actor model: AkkaThe storage: Apache CassandraThe message broker: Apache KafkaThe hardware scheduler: Apache Mesos:

Figure 1.1 The SMACK pipeline architecture

Apache Spark

Spark is a fast and general engine for data processing on a large scale.

The Spark goals are:

Fast data processingEase of useSupporting multiple languagesSupporting sophisticated analyticsReal-time stream processingThe ability to integrate with existing Hadoop dataAn active and expanding community

Here is some chronology:

2009: Spark was initially started by Matei Zaharia at UC Berkeley AMPLab2010: Spark is open-sourced under a BSD license2013: Spark was donated to the Apache Software Foundation and its license to Apache 2.02014: Spark became a top-level Apache Project2014: The engineering team at Databricks used Spark and set a new world record in large-scale sorting

As you are reading this book, you probably know all the Spark advantages. But here, we mention the most important:

Spark is faster than Hadoop: Spark makes efficient use of memory and it is able to execute equivalent jobs 10 to 100 times faster than Hadoop's MapReduce.Spark is easier to use than Hadoop: You can develop in four languages: Scala, Java, Python, and recently R. Spark is implemented in Scala and Akka. When you work with collections in Spark it feels as if you are working with local Java, Scala, or Python collections. For practical reasons, in this book we only provide examples on Scala.Spark scales differently than Hadoop: In Hadoop, you require experts in specialized Hardware to run monolithic Software. In Spark, you can easily increase your cluster horizontally with new nodes with non-expensive and non-specialized hardware. Spark has a lot of tools for you to manage your cluster.Spark has it all in a single framework: The capabilities of coarse grained transformations, real-time data-processing functions, SQL-like handling of structured data, graph algorithms, and machine learning.

It is important to mention that Spark was made with Online Analytical Processing (OLAP) in mind, that is, batch jobs and data mining. Spark was not designed for Online Transaction Processing (OLTP), that is, fast and numerous atomic transactions; for this type of processing, we strongly advise the reader to consider the use of Erlang/Elixir.

Apache Spark has these main components:

Spark CoreSpark SQLSpark StreamingSpark MLIBSpark Graph

The reader will find that each Spark component normally has several books. In this book, we just mention the essentials of Apache Spark to meet the SMACK stack.

In the SMACK stack, Apache Spark is the data-processing engine; it provides near real-time analysis of data (note the word near, because today processing petabytes of data cannot be done in real time).

Akka

Akka is an actor model implementation for JVM, it is a toolkit and runtime for building highly concurrent, distributed, and resilient message-driven applications on the JVM.

The open source Akka toolkit was first released in 2009. It simplifies the construction of concurrent and distributed Java applications. Language bindings exist for both Java and Scala.

It is message-based and asynchronous; typically no mutable data is shared. It is primarily designed for actor-based concurrency:

Actors are arranged hierarchicallyEach actor is created and supervised by its parent actorProgram failures treated as events are handled by an actor's supervisorIt is fault-tolerantIt has hierarchical supervisionCustomizable failure strategies and detectionAsynchronous data passingParallelizedAdaptive and predictiveLoad-balanced

Apache Cassandra

Apache Cassandra is a database with the scalability, availability, and performance necessary to compete with any database system in its class. We know that there are better database systems; however, Apache Cassandra is chosen because of its performance and its connectors built for Spark and Mesos.

In SMACK, Akka, Spark, and Kafka can store the data in Cassandra as a data layer. Also, Cassandra can handle operational data. Cassandra can also be used to serve data back to the application layer.

Cassandra is an open source distributed database that handles large amounts of data; originally started by Facebook in 2008, it became a top-level Apache Project from 2010.

Here are some Apache Cassandra features:

Extremely fastExtremely scalableMulti datacentersThere is no single point of failureCan survive regional faultsEasy to operateAutomatic and configurable replicationFlexible data modelingPerfect for real-time ingestionGreat community

Apache Kafka

Apache Kafka is a distributed commit log, an alternative to publish-subscribe messaging.

Kafka stands in SMACK as the ingestion point for data, possibly on the application layer. This takes data from one or more applications and streams it across to the next points in the stack.

Kafka is a high-throughput distributed messaging system that handles massive data load and avoids back pressure systems to handle floods. It inspects incoming data volumes, which is very important for distribution and partitioning across the nodes in the cluster.

Some Apache Kafka features:

High-performance distributed messagingDecouples data pipelinesMassive data load handlingSupports a massive number of consumersDistribution and partitioning between cluster nodesBroker automatic failover

Apache Mesos

Mesos is a distributed systems kernel. Mesos abstracts all the computer resources (CPU, memory, storage) away from machines (physical or virtual), enabling fault-tolerant and elastic distributed systems to be built easily and run effectively.

Mesos was build using Linux kernel principles and was first presented in 2009 (with the name Nexus). Later in 2011, it was presented by Matei Zaharia.

Mesos is the foundation of several frameworks; the main three are:

Apache AuroraChronosMarathon

In SMACK, Mesos' task is to orchestrate the components and manage resources used.

Changing the data center operations

And here is the point where data processing changes data center operation.

From scale-up to scale-out

Throughout businesses we are moving from specialized, proprietary, and typically expensive supercomputers to the deployment of clusters of commodity machines connected with a low cost network.

The Total Cost of Ownership (TCO) determines the fate, quality, and size of a DataCenter. If the business is small, the DataCenter should be small; as the business demands, the DataCenter will grow or shrink.

Currently, one common practice is to create a dedicated cluster for each technology. This means you have a Spark cluster, a Kafka cluster, a Storm cluster, a Cassandra cluster, and so on, because the overall TCO tends to increase.

The open-source predominance

Modern organizations adopt open source to avoid two old and annoying dependencies: vendor lock-in and external entity bug fixing.