E-Book
33,59 €

Scalable Data Architecture with Java E-Book

Sinchan Banerjee

0,0

33,59 €

Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.

Mehr erfahren.

Herausgeber: Packt Publishing
Kategorie: Lebensstil
Sprache: Englisch

Beschreibung

Java architectural patterns and tools help architects to build reliable, scalable, and secure data engineering solutions that collect, manipulate, and publish data.
This book will help you make the most of the architecting data solutions available with clear and actionable advice from an expert.
You’ll start with an overview of data architecture, exploring responsibilities of a Java data architect, and learning about various data formats, data storage, databases, and data application platforms as well as how to choose them. Next, you’ll understand how to architect a batch and real-time data processing pipeline. You’ll also get to grips with the various Java data processing patterns, before progressing to data security and governance. The later chapters will show you how to publish Data as a Service and how you can architect it. Finally, you’ll focus on how to evaluate and recommend an architecture by developing performance benchmarks, estimations, and various decision metrics.
By the end of this book, you’ll be able to successfully orchestrate data architecture solutions using Java and related technologies as well as to evaluate and present the most suitable solution to your clients.

Details

Das E-Book können Sie in Legimi-Apps oder einer beliebigen App lesen, die das folgende Format unterstützen:

EPUB

MOBI

Seitenzahl: 443

Veröffentlichungsjahr: 2022

Bewertungen

0,0

Rezensionen(0 Rezensionen)

Leseprobe

Scalable Data Architecture with Java

Build efficient enterprise-grade data architecting solutions using Java

Sinchan Banerjee

BIRMINGHAM—MUMBAI

Scalable Data Architecture with Java

All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing or its dealers and distributors, will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book.

Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

Publishing Product Manager: Ali Abidi

Senior Editor: Nazia Shaikh

Content Development Editor: Manikandan Kurup

Technical Editor: Sweety Pagaria

Copy Editor: Safis Editing

Project Coordinator: Farheen Fathima

Proofreader: Safis Editing

Indexer: Rekha Nair

Production Designer: Shyam Sundar Korumilli

Marketing Coordinator: Abeer Dawe

First published: October 2022

Production reference: 1220922

Published by Packt Publishing Ltd.

Livery Place

35 Livery Street

Birmingham

B3 2PB, UK.

ISBN 978-1-80107-308-0

www.packt.com

Dedicated to Maa and Baba, to whom I am eternally thankful for the way they raised me to become the person I am today

Contributors

About the author

Sinchan Banerjee is a principal data architect at UST Inc. He works for their client Anthem to architect, build, and deliver scalable, robust data engineering solutions to solve their business problems. Prior to his journey with UST, he worked for various Fortune 500 organizations, such as Amex, Optum, Impetus, and HP, designing, architecting, and building robust data engineering solutions for very high volumes of data. He is the lead author of a patent on storage capacity forecasting and is the co-author of multiple international publications. He is also a certified AWS Professional and a certified Java programmer. He has also been a recipient of multiple awards and accolades for exceptional technical contribution, leadership, and innovation.

I would like to thank Packt publication for giving me the opportunity to write this book and share my knowledge. I am also grateful to the editorial and technical reviewer team for their valuable inputs and reviews which made the book a better read. Finally, I would like to thank my wife, who partnered this journey of book writing along with me and constantly inspired, supported, and encouraged me to write to the best of my abilities.

About the reviewers

Sourin Sarkar started his journey as a programmer almost 30 years ago. Currently, he is an architect in a top memory technology company. He works on various security solutions in the embedded security space. He has worked in various technology areas during his career and has been architecting technology solutions for the past 15 years while working with technology giants in the security, memory, storage, and data center domains. He is enthusiastic about security technology, various macro to nano embedded technology, automotive solutions, autonomous solutions, memory technology, robotics, green technology sectors, and various other technologies. He is active in the innovation space and has many issued and filed patents to his name.

I would like to thank my parents, teachers, friends, and highly respected mentors in my career for what I am today. Without them, it would not have been possible. I would like to thank Packt Publishing for giving me this opportunity to review a wonderful book and wish all the best to the author, while looking forward to working with Packt Publishing in the future.

Khushboo K is a big data leader with over a decade of IT experience. She has led and delivered several data engineering solutions for various clients in the US, UK, India, and New Zealand. After a stint with multiple multinational corporations, she started her own venture where she provides data engineering consultancy and training services.

Preface

Section 1 – Foundation of Data Systems

1 Basics of Modern Data Architecture

Exploring the landscape of data engineering

What is data engineering?

Dimensions of data

Types of data engineering problems

Responsibilities and challenges of a Java data architect

Data architect versus data engineer

Challenges of a data architect

Techniques to mitigate those challenges

Summary

2 Data Storage and Databases

Understanding data types, formats, and encodings

Data types

Data formats

Understanding file, block, and object storage

File storage

Block storage

Object storage

The data lake, data warehouse, and data mart

Data lake

Data warehouse

Data marts

Databases and their types

Relational database

NoSQL database

Data model design considerations

Summary

3 Identifying the Right Data Platform

Technical requirements

Virtualization and containerization platforms

Benefits of virtualization

Containerization

Benefits of containerization

Kubernetes

Hadoop platforms

Hadoop architecture

Cloud platforms

Benefits of cloud computing

Choosing the correct platform

When to choose virtualization versus containerization

When to use big data

Choosing between on-premise versus cloud-based solutions

Choosing between various cloud vendors

Summary

Section 2 – Building Data Processing Pipelines

4 ETL Data Load – A Batch-Based Solution to Ingesting Data in a Data Warehouse

Technical requirements

Understanding the problem and source data

Problem statement

Understanding the source data

Building an effective data model

Relational data warehouse schemas

Evaluation of the schema design

Designing the solution

Implementing and unit testing the solution

Summary

5 Architecting a Batch Processing Pipeline

Technical requirements

Developing the architecture and choosing the right tools

Problem statement

Analyzing the problem

Architecting the solution

Factors that affect your choice of storage

Determining storage based on cost

The cost factor in the processing layer

Implementing the solution

Profiling the source data

Writing the Spark application

Deploying and running the Spark application

Developing and testing a Lambda trigger

Performance tuning a Spark job

Querying the ODL using AWS Athena

Summary

6 Architecting a Real-Time Processing Pipeline

Technical requirements

Understanding and analyzing the streaming problem

Problem statement

Analyzing the problem

Architecting the solution

Implementing and verifying the design

Setting up Apache Kafka on your local machine

Developing the Kafka streaming application

Unit testing a Kafka Streams application

Configuring and running the application

Creating a MongoDB Atlas cloud instance and database

Configuring Kafka Connect to store the results in MongoDB

Verifying the solution

Summary

7 Core Architectural Design Patterns

Core batch processing patterns

The staged Collect-Process-Store pattern

Common file format processing pattern

The Extract-Load-Transform pattern

The compaction pattern

The staged report generation pattern

Core stream processing patterns

The outbox pattern

The saga pattern

The choreography pattern

The Command Query Responsibility Segregation (CQRS) pattern

The strangler fig pattern

The log stream analytics pattern

Hybrid data processing patterns

The Lambda architecture

The Kappa architecture

Serverless patterns for data ingestion

Summary

8 Enabling Data Security and Governance

Technical requirements

Introducing data governance – what and why

When to consider data governance

The DGI data governance framework

Practical data governance using DataHub and NiFi

Creating the NiFi pipeline

Setting up DataHub

Governance activities

Understanding the need for data security

Solution and tools available for data security

Summary

Section 3 – Enabling Dataas a Service

9 Exposing MongoDB Data as a Service

Technical requirements

Introducing DaaS – what and why

Benefits of using DaaS

Creating a DaaS to expose data using Spring Boot

Problem statement

Analyzing and designing a solution

Implementing the Spring Boot REST application

Deploying the application in an ECS cluster

API management

Enabling API management over the DaaS API using AWS API Gateway

Summary

10 Federated and Scalable DaaS with GraphQL

Technical requirements

Introducing GraphQL – what, when, and why

Operation types

Why use GraphQL?

When to use GraphQL

Core architectural patterns of GraphQL

A practical use case – exposing federated data models using GraphQL

Summary

Section 4 – Choosing Suitable Data Architecture

11 Measuring Performance and Benchmarking Your Applications

Performance engineering and planning

Performance engineering versus performance testing

Tools for performance engineering

Publishing performance benchmarks

Optimizing performance

Java Virtual Machine and garbage collection optimizations

Big data performance tuning

Optimizing streaming applications

Database tuning

Summary

12 Evaluating, Recommending, and Presenting Your Solutions

Creating cost and resource estimations

Storage and compute capacity planning

Effort and timeline estimation

Creating an architectural decision matrix

Data-driven architectural decisions to mitigate risk

Presenting the solution and recommendations

Summary

Index

Other Books You May Enjoy

Preface

When I started writing this book, I looked back at my experience in architecting and developing data engineering solutions, delivering and running those solutions effectively in production, and helping many companies to build and manage scalable and robust data pipelines and asked myself – What are the most useful things that I can share to help an aspiring or beginner data architect, a data engineer, or a Java developer to become an expert data architect? This book reflects the work I do on a daily basis, to design, develop, and maintain scalable, robust, and cost-effective solutions for different data-engineering problems.

Java architectural patterns and tools enable architects to develop reliable, scalable, and secure data engineering solutions to collect, manipulate, manage, and publish data. There are many books and online materials that discuss data architectures in general. There are other sets of books and online materials that focus on and dive deep into the technology stack. While such materials provide architects with essential knowledge, they often lack details on how an architect should approach a data engineering problem practically and create the best-suited architecture by using logical inference. In this book, I have tried to formalize a few techniques by which a data architect can approach a problem to create effective solutions.

In this book, I will take you on a journey in which you learn the basics of data engineering and how to use the basics to analyze and propose solutions for a data engineering problem. I also discuss how a beginner architect can choose the correct technology stack to implement a solution. I also touch upon data security and governance for those solutions.

One of the challenges that architects face is there is always more than one way to do things. We also discuss how to measure different architectural alternatives and how you can correctly choose the best-suited alternative using data-driven techniques.

Who this book is for

Scalable Data Architecture is written for Java developers, data engineers, and aspiring data architects who have at least some working knowledge of either backend systems or data engineering solutions. This book assumes that you have at least some working knowledge of Java and know the basic concepts of Java. This book will help you grow into a successful Java-based data architect.

Data architects and associate architects will find this book helpful to hone their skills and excel at their work. Non-Java backend developers or data engineers can also use the concepts of this book. However, it might be difficult for them to follow the code and implementation of the solutions.

What this book covers

Chapter 1, Basics of Modern Data Architecture, is a short introduction to data engineering, basic concepts of data engineering, and the role a Java data architect plays in data engineering.

Chapter 2, Data Storage and Databases, is a brief discussion about various data types, storage formats, data formats, and databases. It also discusses when to use them.

Chapter 3, Identifying the Right Data Platform, provides an overview of various platforms to deploy data pipelines and how to choose the correct platform.

Chapter 4, ETL Data Load – A Batch-Based Solution to Ingest Data in a Data Warehouse, discusses how to approach, analyze, and architect an effective solution for a batch-based data ingestion problem using Spring Batch and Java.

Chapter 5, Architecting a Batch Processing Pipeline, discusses how to architect and implement a data analysis pipeline in AWS using S3, Apache Spark (Java), AWS Elastic MapReduce (EMR), and AWS Athena for a big data use case.

Chapter 6, Architecting a Real-Time Processing Pipeline, provides a step-by-step guide to building a real-time streaming solution to predict the risk category of a loan application using Java, Kafka, and related technologies.

Chapter 7, Core Architectural Design Patterns, discusses various common architectural patterns used to solve data engineering problems and when to use them.

Chapter 8, Enabling Data security and Governance, introduces data governance and discusses how to apply it using a practical use case. It also briefly touches upon the topic of data security.

Chapter 9, Exposing MongoDB Data as a Service, provides a step-by-step guide on how to build Data as a Service to expose MongoDB data using a REST API.

Chapter 10, Federated and Scalable DaaS with GraphQL, discusses what GraphQL is, various GraphQL patterns, and how to publish data using GraphQL.

Chapter 11, Measuring Performance and Benchmarking Your Applications, provides an overview of performance engineering, how to measure performance and create benchmarks, and how to optimize performance.

Chapter 12, Evaluating, Recommending, and Presenting Your Solutions, discusses how to evaluate and choose the best-suited alternative among various architectures and how to present the recommended architecture effectively.

To get the most out of this book

It is expected that you have knowledge of Core Java and Maven to get the most out of the book. Basic knowledge of Apache Spark is desirable for Chapter 5, Architecting a Batch Processing Pipeline. Basic knowledge of Kafka is desirable for Chapter 6,Architecting a Real-Time Processing Pipeline. Also, basic knowledge of MongoDB is good to have to understand the implementation of Chapters 6, 9, and 10.

You can set up your local environment by ensuring the Java SDK, Maven, and IntelliJ IDEA Community Edition are installed. You can use the following links for installation:

JDK installation guide: https://docs.oracle.com/en/java/javase/11/install/overview-jdk-installation.html#GUID-8677A77F-231A-40F7-98B9-1FD0B48C346AMaven installation guide: https://maven.apache.org/install.htmlIntelliJ IDEA installation guide: https://www.jetbrains.com/help/idea/installation-guide.html

If you are using the digital version of this book, we advise you to type the code yourself or access the code from the book’s GitHub repository (a link is available in the next section). Doing so will help you avoid any potential errors related to the copying and pasting of code.

Download the example code files

You can download the example code files for this book from GitHub at https://github.com/PacktPublishing/Scalable-Data-Architecture-with-Java. If there’s an update to the code, it will be updated in the GitHub repository.

We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/. Check them out!

Download the color images

We also provide a PDF file that has color images of the screenshots and diagrams used in this book. You can download it here: https://packt.link/feLcH.

Conventions used

There are a number of text conventions used throughout this book.

Code in text: Indicates code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles. Here is an example: “So, the KStream bean is created as an instance of KStream<String,String>.”

A block of code is set as follows: public interface Transformer<K, V, R> { void init(ProcessorContext var1); R transform(K var1, V var2); void close(); }

Any command-line input or output is written as follows:

bin/connect-standalone.sh config/connect-standalone.properties connect-riskcalc-mongodb-sink.properties

Bold: Indicates a new term, an important word, or words that you see onscreen. For instance, words in menus or dialog boxes appear in bold. Here is an example: “Here, click the Build a Database button to create a new database instance.”

Tips or important notes

Appear like this.

Get in touch

Feedback from our readers is always welcome.

General feedback: If you have questions about any aspect of this book, email us at [email protected] and mention the book title in the subject of your message.

Errata: Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you have found a mistake in this book, we would be grateful if you would report this to us. Please visit www.packtpub.com/support/errata and fill in the form.

Piracy: If you come across any illegal copies of our works in any form on the internet, we would be grateful if you would provide us with the location address or website name. Please contact us at [email protected] with a link to the material.

If you are interested in becoming an author: If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, please visit authors.packtpub.com.

Share Your Thoughts

Once you’ve read Scalable Data Architecture with Java, we’d love to hear your thoughts! Please click here to go straight to the Amazon review page for this book and share your feedback.

Your review is important to us and the tech community and will help us make sure we’re delivering excellent quality content.

Section 1 – Foundation of Data Systems

In this section, you will be introduced to various kinds of data engineering problems and the role of a data architect in solving the problems. You will also learn the basics of data format, storage, databases, and data platforms needed to architect a solution.

This section comprises the following chapters:

Chapter 1, Basics of Modern Data ArchitectureChapter 2, Data Storage and Databases Chapter 3, Identifying the Right Data Platform

1 Basics of Modern Data Architecture

With the advent of the 21st century, due to more and more internet usage and more powerful data insight tools and technologies emerging, there has been a data explosion, and data has become the new gold. This has implied an increased demand for useful and actionable data, as well as the need for quality data engineering solutions. However, architecting and building scalable, reliable, and secure data engineering solutions is often complicated and challenging.

A poorly architected solution often fails to meet the needs of the business. Either the data quality is poor, it fails to meet the SLAs, or it’s not sustainable or scalable as the data grows in production. To help data engineers and architects build better solutions, every year, dozens of open source and preoperatory tools get released. Even a well-designed solution sometimes fails because of a poor choice or implementation of the tools.

This book discusses various architectural patterns, tools, and technologies with step-by-step hands-on explanations to help an architect choose the most suitable solution and technology stack to solve a data engineering problem. Specifically, it focuses on tips and tricks to make architectural decisions easier. It also covers other essential skills that a data architect requires such as data governance, data security, performance engineering, and effective architectural presentation to customers or upper management.

In this chapter, we will explore the landscape of data engineering and the basic features of data in modern business ecosystems. We will cover various categories of modern data engineering problems that a data architect tries to solve. Then, we will learn about the roles and responsibilities of a Java data architect. We will also discuss the challenges that a data architect faces while designing a data engineering solution. Finally, we will provide an overview of the techniques and tools that we’ll discuss in this book and how they will help an aspiring data architect do their job more efficiently and be more productive.

In this chapter, we’re going to cover the following main topics:

Exploring the landscape of data engineeringResponsibilities and challenges of a Java data architectTechniques to mitigate those challenges

Exploring the landscape of data engineering

In this section, you will learn what data engineering is and why it is needed. You will also learn about the various categories of data engineering problems and some real-world scenarios where they are found. It is important to understand the varied nature of data engineering problems before you learn how to architect solutions for such real-world problems.

What is data engineering?

By definition, data engineering is the branch of software engineering that specializes in collecting, analyzing, transforming, and storing data in a usable and actionable form.

With the growth of social platforms, search engines, and online marketplaces, there has been an exponential increase in the rate of data generation. In 2020 alone, around 2,500 petabytes of data was generated by humans each day. It is estimated that this figure will go up to 468 exabytes per day by 2025. The high volume and availability of data have enabled rapid technological development in AI and data analytics. This has led businesses, corporations, and governments to gather insights like never before to give customers a better experience of their services.

However, raw data usually is seldom used. As a result, there is an increased demand for creating usable data, which is secure and reliable. Data engineering revolves around creating scalable solutions to collect the raw data and then analyze, validate, transform, and store it in a usable and actionable format. Optionally, in certain scenarios and organizations, in modern data engineering, businesses expect usable and actionable data to be published as a service.

Before we dive deeper, let’s explore a few practical use cases of data engineering:

Use case 1: American Express (Amex) is a leading credit card provider, but it has a requirement to group customers with similar spending behavior together. This ensures that Amex can generate personalized offers and discounts for targeted customers. To do this, Amex needs to run a clustering algorithm on the data. However, the data is collected from various sources. A few data flows from MobileApp, a few flows from different Salesforce organizations such as sales and marketing, and a few data flows from logs and JSON events will be required. This data is known as raw data, and it can contain junk characters, missing fields, special characters, and sometimes unstructured data such as log files. Here, the data engineering team ingests that data from different sources, cleans it, transforms it, and stores it in a usable structured format. This ensures that the application that performs clustering can run on clean and sorted data.Use case 2: A health insurance provider receives data from multiple sources. This data comes from various consumer-facing applications, third-party vendors, Google Analytics, other marketing platforms, and mainframe batch jobs. However, the company wants a single data repository to be created that can serve different teams as the source of clean and sorted data. Such a requirement can be implemented with the help of data engineering.

Now that we understand data engineering, let’s look at a few of its basic concepts. We will start by looking at the dimensions of data.

Dimensions of data

Any discussion on data engineering is incomplete without talking about the dimensions of data. The dimensions of data are some basic characteristics by which the nature of data can be analyzed. The starting point of data engineering is analyzing and understanding the data.

To successfully analyze and build a data-oriented solution, the four Vs of modern data analysis are very important. These can be seen in the following diagram:

Figure 1.1 – Dimensions of data

Let’s take a look at each of these Vs in detail:

Volume: This refers to the size of data. The size of the data can be as small as a few bytes to as big as a few hundred petabytes. Volume analysis usually involves understanding the size of the whole dataset or the size of a single data record or event. Understanding the size is essential in choosing the type of technologies and infrastructure sizing decisions to process and store the data.Velocity: This refers to the speed at which data is getting generated. High-velocity data requires distributed processing. Analyzing the speed of data generation is especially critical for scenarios where businesses require usable data to be made available in real-time or near-real-time.Variety: This refers to the various variations in the format in which the data source can generate the data. Usually, they can be one of the three following types:Structured: Structured data is where the number of columns, their data types, and their positions are fixed. All classical datasets that fit neatly in the relational data model are perfect examples of structured data.Unstructured: These datasets don’t conform to a specific structure. Each record in such a dataset can have any number of columns in any arbitrary format. Examples include audio and video files.Semi-structured: Semi-structured data has a structure, but the order of the columns and the presence of a column in each record is optional. A classical example of such a dataset is any hierarchical data source, such as a .json or a .xml file.Veracity: This refers to the trustworthiness of the data. In simple terms, it is related to the quality of the data. Analyzing the noise of data is as important as analyzing any other aspect of the data. This is because this analysis helps create a robust processing rule that ultimately determines how successful a data engineering solution is. Many well-engineered and designed data engineering solutions fail in production due to a lack of understanding about the quality and noise of the source data.

Now that we have a fair idea of the characteristics by which the nature of data can be analyzed, let’s understand how they play a vital role in different types of data engineering problems.

Types of data engineering problems

Broadly speaking, the kinds of problems that data engineers solve can be classified into two basic types:

Processing problemsPublishing problems

Let’s take a look at these problems in more detail.

Processing problems

The problems that are related to collecting raw data or events, processing them, and storing them in a usable or actionable data format are broadly categorized as processing problems. Typical use cases can be a data ingestion problem such as Extract, Transform, Load (ETL) or a data analytics problem such as generating a year-on-year report.

Again, processing problems can be divided into three major categories, as follows:

Batch processingReal-time processingNear real-time processing

This can be seen in the following diagram:

Figure 1.2 – Categories of processing problems

Let’s take a look at each one of these categories in detail.

Batch processing

If the SLA of processing is more than 1 hour (for example, if the processing needs to be done once in 2 hours, once daily, once weekly, or once biweekly), then such a problem is called a batch processing problem. This is because, when a system processes data at a longer time interval, it usually processes a batch of data records and not a single record/event. Hence, such processing is called batch processing:

Figure 1.3 – Batch processing problem

Usually, a batch processing solution depends on the volume of data. If the data volume is more than tens of terabytes, usually, it needs to be processed as big data. Also, since big data processes are schedule-driven, a workflow manager or schedular needs to run its jobs. We will discuss batch processing in more detail later in this book.

Real-time processing

A real-time processing problem is a use case where raw data/events are to be processed on the fly and the response or the processing outcome should be available within seconds, or at most within 2 to 5 minutes.

As shown in the following diagram, a real-time process receives data in the form of an event stream and immediately processes it. Then, it either sends the processed event to a sink or to another stream of events to be processed further. Since this kind of processing happens on a stream of events, this is known as real-time stream processing:

Figure 1.4 – Real-time stream processing

As shown in Figure 1.4, event E0 gets processed and sent out by the streaming application, while events E1, E2 and E3 are waiting to be processed in the queue. At t1, event E1 also gets processed, showing continuous processing of events by streaming application

An event can generate at any time (24/7), which creates a new kind of problem. If the producer application of an event directly sends the event to a consumer, there is a chance of event loss, unless the consumer application is running 24/7. Even bringing down the consumer application for maintenance or upgrades isn’t possible, which means there should be zero downtime for the consumer application. However, any application with zero downtime is not realistic. Such a model of communication between applications is called point-to-point communication.

Another challenge in point-to-point communication for real-time problems is the speed of processing as this should be always equal to or greater than that of a producer. Otherwise, there will be a loss of events or a possible memory overrun of the consumer. So, instead of directly sending events to the consumer application, they are sent asynchronously to an Event Bus or a Message Bus. An Event Bus is a high availability container that can hold events such as a queue or a topic. This pattern of sending and receiving data asynchronously by introducing a high availability Event Bus in between is called the Pub-Sub framework.

The following are some important terms related to real-time processing problems:

Events: This can be defined as a data packet generated as a result of an action, a trigger, or an occurrence. They are also popularly known as messages in the Pub-Sub framework.Producer: A system or application that produces and sends events to a Message Bus is called a publisher or a producer.Consumer: A system or application that consumes events from a Message Bus to process is called a consumer or a subscriber. Queue: This has a single producer and a single consumer. Once a message/event is consumed by a consumer, that event is removed from the queue. As an analogy, it’s like an SMS or an email sent to you by one of your friends.Topic: Unlike a queue, a topic can have multiple consumers and producers. It’s a broadcasting channel. As an analogy, it’s like a TV channel such as HBO, where multiple producers are hosting their show, and if you have subscribed to that channel, you will be able to watch any of those shows.

A real-world example of a real-time problem is credit card fraud detection, where you might have experienced an automated confirmation call to verify the authenticity of a transaction from your bank, if any transaction seems suspicious while being executed.

Near-real-time processing

Near-real-time processing, as its name suggests, is a problem whose response or processing time doesn’t need to be as fast as real time but should be less than 1 hour. One of the features of near-real-time processing is that it processes events in micro batches. For example, a near-real-time process may process data in a batch interval of every 5 minutes, a batch size of every 100 records, or a combination of both (whichever condition is satisfied first).

At time tx, all events (E1, E2 and E3) that are generated between t0 and tx are processed together by near real-time processing job. Similarly all events (E4, E5 and E6) between time tx and tn are processed together.

Figure 1.5 – Near-real-time processing

Typical near-real-time use cases are recommendation problems such as product recommendations for services such as Amazon or video recommendations for services such as YouTube and Netflix.

Publishing problems

Publishing problems deal with publishing the processed data to different businesses and teams so that data is easily available with proper security and data governance. Since the main goal of the publishing problem is to expose the data to a downstream system or an external application, having extremely robust data security and governance is essential.

Usually, in modern data architectures, data is published in one of three ways:

Sorted data repositoriesWeb servicesVisualizations

Let’s take a closer look at each.

Sorted data repositories

Sorted data repositories is a common term used for various kinds of repositories that are used to store processed data. This is usable and actionable data and can be directly queried by businesses, analytics teams, and other downstream applications for their use cases. They are broadly divided into three types:

Data warehouseData lakeData hub

A data warehouse is a central repository of integrated and structured data that’s mainly used for reporting, data analysis, and Business Intelligence (BI). A data lake consists of structured and unstructured data, which is mainly used for data preparation, reporting, advanced analytics, data science, and Machine Learning (ML). A data hub is the central repository of trusted, governed, and shared data, which enables seamless data sharing between diverse endpoints and connects business applications to analytic structures such as data warehouses and data lakes.

Web services

Another publishing pattern is where data is published as a service, popularly known as Data as a Service. This data publishing pattern has many advantages as it enables security, immutability, and governance by design. Nowadays, as cloud technologies and GraphQL are becoming popular, Data-as-a-Service is getting a lot of traction in the industry.

The two popular mechanisms of publishing Data as a Service are as follows:

RESTGraphQL

We will discuss these techniques in detail later in this book.

Visualization

There’s a popular saying: A picture is worth a thousand words. Visualization is a technique by which reports, analytics, and statistics about the data are captured visually in graphs and charts.

Visualization is helpful for businesses and leadership to understand, analyze, and get an overview of the data flowing in their business. This helps a lot in decision-making and business planning.

A few of the most common and popular visualization tools are as follows:

Tableau is a proprietary data visualization tool. This tool comes with multiple source connectors to import data into it and create easy fast visualization using drag-and-drop visualization components such as graphs and charts. You can find out more about this product at https://www.tableau.com/.Microsoft Power BI is a proprietary tool from Microsoft that allows you to collect data from various data sources to connect and create powerful dashboards and visualizations for BI. While both Tableau and Power BI offer data visualization and BI, Tableau is more suited for seasoned data analysts, while Power BI is useful for non-technical or inexperienced users. Also, Tableau works better with huge volumes of data compared to Power BI. You can find out more about this product at https://powerbi.microsoft.com/.Elasticsearch-Kibana is an open source tool whose source code is open source and has free versions for on-premise installations and paid subscriptions for cloud installation. This tool helps you ingest data from any data source into Elasticsearch and create visualizations and dashboards using Kibana. Elasticsearch is a powerful text-based Lucene search engine that not only stores the data but enables various kinds of data aggregation and analysis (including ML analysis). Kibana is a dashboarding tool that works together with Elasticsearch to create very powerful and useful visualizations. You can find out more about these products at https://www.elastic.co/elastic-stack/.

Important note

A Lucene index is a full-text inverse index. This index is extremely powerful and fast for text-based searches and is the core indexing technology behind most search engines. A Lucene index takes all the documents, splits them into words or tokens, and then creates an index for each word.

Apache Superset is a completely open source data visualization tool (developed by Airbnb). It is a powerful dashboarding tool and is completely free, but its data source connector support is limited, mostly to SQL databases. A few interesting features are its built-in role-based data access, an API for customization, and extendibility to support new visualization plugins. You can find out more about this product at https://superset.apache.org/.

While we have briefly discussed a few of the visualization tools available in the market, there are many visualizations and competitive alternatives available. Discussing data visualization in more depth is beyond the scope of this book.

So far, we have provided an overview of data engineering and the various types of data engineering problems. In the next section, we will explore what role a Java data architect plays in the data engineering landscape.

Responsibilities and challenges of a Java data architect

Data architects are senior technical leaders who map business requirements to technical requirements, envision technical solutions to solve business problems, and establish data standards and principles. Data architects play a unique role, where they understand both the business and technology. They are like the Janus of business and technology, where on one hand they can look, understand, and communicate with the business, and on the other, they do the same with technology. Data architects create processes that are used to plan, specify, enable, create, acquire, maintain, use, archive, retrieve, control, and purge data. According to DAMMA’s data management body of knowledge, a data architect provides a standard common business vocabulary, expresses strategic requirements, outlines high-level integrated designs to meet those requirements, and aligns with the enterprise strategy and related business architecture.

The following diagram shows the cross-cutting concerns that a data architect handles:

Figure 1.6 – Cross-cutting concerns of a data architect

The typical responsibilities of a Java data architect are as follows:

Interpreting business requirements into technical specifications, which includes data storage and integration patterns, databases, platforms, streams, transformations, and the technology stackEstablishing the architectural framework, standards, and principlesDeveloping and designing reference architectures that are used as patterns that can be followed by others to create and improve data systemsDefining data flows and their governance principlesRecommending the most suitable solutions, along with their technology stacks, while considering scalability, performance, resource availability, and costCoordinating and collaborating with multiple departments, stakeholders, partners, and external vendors

In the real world, a data architect is supposed to play a combination of three disparate roles, as shown in the following diagram:

Figure 1.7 – Multifaced role of a data architect

Let’s look at these three architectural roles in more detail:

Data architectural gatekeeper: An architectural gatekeeper is a person or a role that ensures the data model is following the necessary standards and that the architecture is following the proper architectural principles. They look for any gaps in terms of the solution or business expectations. Here, a data architect takes a negative role in finding faults or gaps in the product or solution design and delivery (including a lack of or any gap in best practices in the data model, architecture, implementation techniques, testing procedures, continuous integration/continuous delivery (CI/CD) efforts, or business expectations).Data advisor: A data advisor is a data architect that focuses more on finding solutions rather than finding a problem. A data advisor highlights issues, but more importantly, they show an opportunity or propose a solution for them. A data advisor should understand the technical as well as the business aspect of a problem and solution and should be able to advise to improve the solution.Business executive: Apart from the technical roles that a data architect plays, the data architect needs to play an executive role as well. As stated earlier, the data architect is like the Janus of business and technology, so they are expected to be a great communicator and sales executive who can sell their idea or solution (that is technical) to nontechnical folks. Often, a data architect needs to present elevator speeches to higher leadership to show opportunities and convince them of a solution for business problems. To be successful in this role, a data architect must think like a business executive – What is the ROI? Or what is there for me in it? How much can we save in terms of time and money with this solution or opportunity? Also, a data architect should be concise and articulate in presenting their idea so that it creates immediate interest among the listeners (mostly business executives, clients, or investors).

Let’s understand the difference between a data architect and data engineer.

Data architect versus data engineer

The data architect and data engineer are related roles. A data architect visualizes, conceptualizes, and creates the blueprint of the data engineering solution and framework, while the data engineer takes the blueprint and implements the solution.

Data architects are responsible for putting data chaos in order, generated by enormous piles of business data. Each data analytics or data science team requires a data architect who can visualize and design the data framework to create clean, analyzed, managed, formatted, and secure data. This framework can be utilized further by data engineers, data analysts, and data scientists for their work.

Challenges of a data architect

Data architects face a lot of challenges in their day-to-day work. We will be focusing on the main challenges that a data architect faces on a day-to-day basis:

Choosing the right architectural patternChoosing the best-fit technology stackLack of actionable data governanceRecommending and communicating effectively to leadership

Let’s take a closer look.

Choosing the right architectural pattern

A single data engineering problem can be solved in many ways. However, with the ever-evolving expectations of customers and the evolution of new technologies, choosing the correct architectural pattern has become more challenging. What is more interesting is that with the changing technological landscape, the need for agility and extensibility in architecture has increased many folds to avoid unnecessary costs and sustainability of architecture over time.

Choosing the best-fit technology stack

One of the complex problems that a data architect needs to figure out is the technology stack. Even when you have created a very well-architected solution, whether your solution will fly or flop will depend on the technology stack you are choosing and how you are planning to use it. As more and more tools, technologies, databases, and frameworks are developed, a big challenge remains for data architects to choose an optimum tech stack that can help create a scalable, reliable, and robust solution. Often, a data architect needs to take into account other non-technical factors as well, such as the future growth prediction of the tool, the market availability of skilled resources for those tools, vendor lock-in, cost, and community support options.

Lack of actionable data governance

Data governance is a buzzword in data businesses, but what does it mean? Governance is a broad area that includes both workflows and toolsets to govern data. If either the tools or the workflow process has limitations or is not present, then data governance is incomplete. When we talk about actionable governance, we mean the following elements:

Integrating data governance with all data engineering systems to maintain standard metadata, including traceability of events and logs for a standard timelineIntegrating data governance concerning all the security policies and standardsRole-based and user-based access management policies on all data elements and systemsAdherence to defined metrics that are tracked continuallyIntegrating data governance and the data architecture

Data governance should always be aligned with strategic and organizational goals.

Recommending and communicating effectively to leadership

Creating an optimal architecture and the correct set of tools is a challenging task, but it never is enough, unless and until they are not put into practice. One of the hats that a data architect often needs to wear is that of a sales executive who needs to sell their solution to the business executive or upper leadership. These are not usually technical people and they don’t have a lot of time. Data architects, most of whom have strong technical backgrounds, face the daunting task of communicating and selling their idea to these people. To convince them about the opportunity and the idea, a data architect needs to back them up with proper decision metrics and information that can align that opportunity to the broader business goals of the organization.

So far, we have seen the role of a data architect and the common problems that they face. In the next section, we will provide an overview of how a data architect mitigates those challenges on a day-to-day basis.

Techniques to mitigate those challenges

In this section, we will discuss how a data architect can mitigate the aforementioned challenges. To understand the mitigation plan, it is important to understand what the life cycle of a data architecture looks like and how a data architect contributes to it. The following diagram shows the life cycle of a data architecture:

Figure 1.8 – Life cycle of a data architecture

The data architecture starts with defining the problem that the business is facing. Here, this is mainly identified or reported by business teams or customers. Then, the data architects work closely with the business to define the business requirements. However, in a data engineering landscape, that is not enough. In a lot of cases, there are hidden requirements or anomalies. To mitigate such problems, business analysts team up with data architects to analyze data and the current state of the system, including any existing solution, the current cost, or loss of revenue due to the problem and infrastructure where data resides. This helps refine the business requirements. Once the business requirements are more or less frozen, the data architects map the business requirements to the technical requirements.

Then, the data architect defines the standards and principles of the architecture and determines the priorities of the architecture based on the business need and budget. After that, the data architect creates the most suitable architectures, along with their proposed technology stack. In this phase, the data architects closely work with the data engineers to implement proof of concept (POCs) and evaluate the proposed solution in terms of feasibility, scalability, and performance.

Finally, the architects recommend solutions based on the evaluation results and architectural priorities defined earlier. The data architects present the proposed solutions to the business. Based on priorities such as cost, timeline, operational cost, and resource availability, feedback is received from the business and clients. It takes a few iterations to solidify and get an agreement on the architecture.

Once an agreement has been reached, the solution is implemented. Based on the implementation challenges and particular use cases, the architecture may or may not be revised or tweaked a little. Once an architecture is implemented and goes to production, it enters the maintenance and operations phase. During maintenance and operations, sometimes, feedback is provided, which might result in a few architectural improvements and changes, but they are often seldom if the solution is well-architected in the first place.

In the preceding diagram, the blue boxes indicate major involvement from a customer, a green box indicates major involvement from a data architect, a yellow box means a data architect equally shares involvement with another stakeholder, and a gray box means the data architect has the least involvement in that scenario.

Now that we have understood the life cycle of the data architecture and a data architect’s role in various phases, we will focus on how to mitigate those challenges that are faced by a data architect. This book covers how to mitigate those challenges in the following way:

Understanding the business data, its characteristics, and storage options: Data and its characteristics were covered earlier in this chapter; it will also be covered partly in Chapter 2, Data Storage and DatabasesStorage options will also be discussed in Chapter 2, Data Storage and DatabasesAnalyzing and defining the business problem:Understanding the various kinds of data engineering problems (covered in this chapter)We have provided a step-by-step analysis of how an architect should analyze a business problem, classify, and define it in Chapter 4, ETL Data Load – A Batch-Based Solution to Ingest Data in a Data Warehouse, Chapter 5, Architecting a Batch Processing Pipeline, and Chapter 6, Architecting a Real-Time Processing PipelineThe challenge of choosing the right architecture. To choose the right architectural pattern, we should be aware of the following:The types of data engineering problems and the dimensions of data (we discussed this in this chapter) The different types of data and various data storage available (Chapter 2, Data Storage and Databases)How to model and design different kinds of data while storing it in a database (Chapter 2, Data Storage and Databases) Understanding various architectural patterns for data processing problems (Chapter 7, Core Architectural Design Patterns)Understanding the architectural patterns of publishing the data (Section 3, Enabling Data as a Service)The challenge of choosing the best-fit technology stack and data platform. To choose the correct set of tools, we need to know how to use a tool and when to use what tools we have:How to choose the correct database will be discussed in Chapter 2, Data Storage and DatabasesHow to choose the correct platform will be discussed in Chapter 3, Identifying the Right Data PlatformA step-by-step hands-on guide to using different tools in batch processing will be covered in Chapter 4, ETL Data Load – A Batch-Based Solution to Ingest Data in a Data Warehouse, and Chapter 5, Architecting a Batch Processing PipelineA step-by-step guide to architecting real-time stream processing and choosing the correct tools will be covered in Chapter 6, Architecting a Real-Time Processing PipelineThe different tools and technologies used in data publishing will be discussed in Chapter 9, Exposing MongoDB Data as a Service, and Chapter 10, Federated and Scalable DaaS with GraphQLThe challenge of creating a design for scalability and performance will be covered in Chapter 11, Measuring Performance and Benchmarking Your Applications. Here, we will discuss the following:Performance engineering basicsThe publishing performance benchmarkPerformance optimization and tuningThe challenge of a lack of data governance. Various data governance and security principles and tools will be discussed in Chapter 8, Enabling Data Security and Governance.The challenge of evaluating architectural solutions and recommending them to leadership. In the final chapter of this book (Chapter 12, Evaluating, Recommending, and Presenting Your Solution), we will use the various concepts that we have learned throughout this book to create actionable data metrics and determine the most optimized solution. Finally, we will discuss techniques that an architect can apply to effectively communicate with business stakeholders, executive leadership, and investors.

In this section, we discussed how this book can help an architect overcome the various challenges they will face and make them more effective in their role. Now, let’s summarize this chapter.

Summary

In this chapter, we learned what data engineering is and looked at a few practical examples of data engineering. Then, we covered the basics of data engineering, including the dimensions of data and the kinds of problems that are solved by data engineers. We also provided a high-level overview of various kinds of processing problems and publishing problems in a data engineering landscape. Then, we discussed the roles and responsibilities of a data architect and the kind of challenges they face. We also briefly covered the way this book will guide you to overcome challenges and dilemmas faced by a data architect and help you become a better Java data architect.

Now that you understand the basic landscape of data engineering and what this book will focus on, in the next chapter, we will walk through various data formats, data storage options, and databases and learn how to choose one for the problem at hand.

2 Data Storage and Databases

In the previous chapter, we understood the foundations of modern data engineering and what architects are supposed to do. We also covered how data is growing at an exponential rate. However, to make use of that data, we need to understand how to store it efficiently and effectively.

Tausende von E-Books und Hörbücher

Ihre Zahl wächst ständig und Sie haben eine Fixpreisgarantie.

Sie haben über uns geschrieben:

Scalable Data Architecture with Java E-Book

Sinchan Banerjee

Scalable Data Architecture with Java

Scalable Data Architecture with Java

Contributors

About the author

About the reviewers

Table of Contents

Preface

Section 1 – Foundation of Data Systems

1

Basics of Modern Data Architecture

Exploring the landscape of data engineering

What is data engineering?

Dimensions of data

Types of data engineering problems

Responsibilities and challenges of a Java data architect

Data architect versus data engineer

Challenges of a data architect

Techniques to mitigate those challenges

Summary

2

Data Storage and Databases

Understanding data types, formats, and encodings

Data types

Data formats

Understanding file, block, and object storage

File storage

Block storage

Object storage

The data lake, data warehouse, and data mart

Data lake

Data warehouse

Data marts

Databases and their types

Relational database

NoSQL database

Data model design considerations

Summary

3

Identifying the Right Data Platform

Technical requirements

Virtualization and containerization platforms

Benefits of virtualization

Containerization

Benefits of containerization

Kubernetes

Hadoop platforms

Hadoop architecture

Cloud platforms

Benefits of cloud computing

Choosing the correct platform

When to choose virtualization versus containerization

When to use big data

Choosing between on-premise versus cloud-based solutions

Choosing between various cloud vendors

Summary

Section 2 – Building Data Processing Pipelines

4

ETL Data Load – A Batch-Based Solution to Ingesting Data in a Data Warehouse

Technical requirements

Understanding the problem and source data

Problem statement

Understanding the source data

Building an effective data model

Relational data warehouse schemas

Evaluation of the schema design

Designing the solution

Implementing and unit testing the solution

Summary

5

Architecting a Batch Processing Pipeline

Technical requirements

Developing the architecture and choosing the right tools

Problem statement

Analyzing the problem

Architecting the solution

Factors that affect your choice of storage

Determining storage based on cost

The cost factor in the processing layer