Big Data on Kubernetes - Neylson Crepalde - E-Book

Big Data on Kubernetes E-Book

Neylson Crepalde

0,0
28,99 €

-100%
Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.

Mehr erfahren.
Beschreibung

In today's data-driven world, organizations across different sectors need scalable and efficient solutions for processing large volumes of data. Kubernetes offers an open-source and cost-effective platform for deploying and managing big data tools and workloads, ensuring optimal resource utilization and minimizing operational overhead. If you want to master the art of building and deploying big data solutions using Kubernetes, then this book is for you.
Written by an experienced data specialist, Big Data on Kubernetes takes you through the entire process of developing scalable and resilient data pipelines, with a focus on practical implementation. Starting with the basics, you’ll progress toward learning how to install Docker and run your first containerized applications. You’ll then explore Kubernetes architecture and understand its core components. This knowledge will pave the way for exploring a variety of essential tools for big data processing such as Apache Spark and Apache Airflow. You’ll also learn how to install and configure these tools on Kubernetes clusters. Throughout the book, you’ll gain hands-on experience building a complete big data stack on Kubernetes.
By the end of this Kubernetes book, you’ll be equipped with the skills and knowledge you need to tackle real-world big data challenges with confidence.

Das E-Book können Sie in Legimi-Apps oder einer beliebigen App lesen, die das folgende Format unterstützen:

EPUB
MOBI

Seitenzahl: 344

Veröffentlichungsjahr: 2024

Bewertungen
0,0
0
0
0
0
0
Mehr Informationen
Mehr Informationen
Legimi prüft nicht, ob Rezensionen von Nutzern stammen, die den betreffenden Titel tatsächlich gekauft oder gelesen/gehört haben. Wir entfernen aber gefälschte Rezensionen.



Big Data on Kubernetes

A practical guide to building efficient and scalable data solutions

Neylson Crepalde

Big Data on Kubernetes

Copyright © 2024 Packt Publishing

All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing or its dealers and distributors, will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book.

Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

Group Product Manager: Apeksha Shetty

Publishing Product Manager: Apeksha Shetty

Book Project Manager: Aparna Ravikumar Nair

Senior Editor: Sushma Reddy

Technical Editor: Kavyashree K S

Copy Editor: Safis Editing

Proofreader: Sushma Reddy

Indexer: Subalakshmi Govindhan

Production Designer: Gokul Raj S T

DevRel Marketing Executive: Nivedita Singh

First published: July 2024

Production reference: 1210624

Published by Packt Publishing Ltd.

Grosvenor House

11 St Paul’s Square

Birmingham

B3 1RB, UK

ISBN: 978-1-83546-214-0

www.packtpub.com

To my wife, Sarah, and my son, Joao Pedro, for their love and support. To Silvio Salej Higgins for being a great mentor.

– Neylson Crepalde

Contributors

About the author

Neylson Crepalde is a generative AI strategist at Amazon Web Services (AWS). Before this, Neylson was chief technology officer at A3Data, a consulting business focused on data, analytics, and artificial intelligence. In his time as CTO, he worked with the company’s tech team to build a Big Data architecture on top of Kubernetes that inspired the writing of this book. Neylson holds a PhD in economic sociology, and he was a visiting scholar at the Centre des Sociologies des Organisations at Sciences Po, Paris. Neylson is also a frequent guest speaker at conferences and has taught MBA programs for more than 10 years.

I want to thank all the people who have worked with me in the development of this great architecture, especially Mayla Teixeira and Marcus Oliveira for their outstanding contributions.

About the reviewer

Thariq Mahmood has 16 years of experience in data technology and possesses a strong skillset in Kubernetes, Big Data, data engineering, and DevOps on the public cloud, the private cloud, and on-premise environments. He has expertise in data warehousing, data modeling, and data security. He actively contributes to projects on Git and has experience setting up batch and streaming pipelines for various production environments, using Databricks, Hadoop, Spark, Flink, and other cloud-native tools from AWS, Azure, and GCP. Also, he implemented MLOps and DevSecOps in numerous projects. He currently works on helping organizations optimize their Big Data infrastructure costs and implementing data-lake and one-lake architectures within Kubernetes.

Table of Contents

Preface

Part 1: Docker and Kubernetes

1

Getting Started with Containers

Technical requirements

Container architecture

Installing Docker

Windows

macOS

Linux

Getting started with Docker images

hello-world

NGINX

Julia

Building your own image

Batch processing job

API service

Summary

2

Kubernetes Architecture

Technical requirements

Kubernetes architecture

Control plane

Node components

Pods

Deployments

StatefulSets

Jobs

Services

ClusterIP Service

NodePort Service

LoadBalancer Service

Ingress and Ingress Controller

Gateway

Persistent Volumes

StorageClasses

ConfigMaps and Secrets

ConfigMaps

Secrets

Summary

3

Getting Hands-On with Kubernetes

Technical requirements

Installing kubectl

Deploying a local cluster using Kind

Installing kind

Deploying the cluster

Deploying an AWS EKS cluster

Deploying a Google Cloud GKE cluster

Deploying an Azure AKS cluster

Running your API on Kubernetes

Creating the deployment

Creating a service

Using an ingress to access the API

Running a data processing job in Kubernetes

Summary

Part 2: Big Data Stack

4

The Modern Data Stack

Data architectures

The Lambda architecture

The Kappa architecture

Comparing Lambda and Kappa

Data lake design for big data

Data warehouses

The rise of big data and data lakes

The rise of the data lakehouse

Implementing the lakehouse architecture

Batch ingestion

Storage

Batch processing

Orchestration

Batch serving

Data visualization

Real-time ingestion

Real-time processing

Real-time serving

Real-time data visualization

Summary

5

Big Data Processing with Apache Spark

Technical requirements

Getting started with Spark

Installing Spark locally

Spark architecture

Spark executors

Components of execution

Starting a Spark program

The DataFrame API and the Spark SQL API

Transformations

Actions

Lazy evaluation

Data partitioning

Narrow versus wide transformations

Analyzing the titanic dataset

Working with real data

How Spark performs joins

Joining IMDb tables

Summary

6

Building Pipelines with Apache Airflow

Technical requirements

Getting started with Airflow

Installing Airflow with Astro

Airflow architecture

Airflow’s distributed architecture

Building a data pipeline

Airflow integration with other tools

Summary

7

Apache Kafka for Real-Time Events and Data Ingestion

Technical requirements

Getting started with Kafka

Exploring the Kafka architecture

The PubSub design

How Kafka delivers exactly-once semantics

First producer and consumer

Streaming from a database with Kafka Connect

Real-time data processing with Kafka and Spark

Summary

Part 3: Connecting It All Together

8

Deploying the Big Data Stack on Kubernetes

Technical requirements

Deploying Spark on Kubernetes

Deploying Airflow on Kubernetes

Deploying Kafka on Kubernetes

Summary

9

Data Consumption Layer

Technical requirements

Getting started with SQL query engines

The limitations of traditional data warehouses

The rise of SQL query engines

The architecture of SQL query engines

Deploying Trino in Kubernetes

Connecting DBeaver with Trino

Deploying Elasticsearch in Kubernetes

How Elasticsearch stores, indexes and manages data

Elasticsearch deployment

Summary

10

Building a Big Data Pipeline on Kubernetes

Technical requirements

Checking the deployed tools

Building a batch pipeline

Building the Airflow DAG

Creating SparkApplication jobs

Creating a Glue crawler

Building a real-time pipeline

Deploying Kafka Connect and Elasticsearch

Real-time processing with Spark

Deploying the Elasticsearch sink connector

Summary

11

Generative AI on Kubernetes

Technical requirements

What generative AI is and what it is not

The power of large neural networks

Challenges and limitations

Using Amazon Bedrock to work with foundational models

Building a generative AI application on Kubernetes

Deploying the Streamlit app

Building RAG with Knowledge Bases for Amazon Bedrock

Adjusting the code for RAG retrieval

Building action models with agents

Creating a DynamoDB table

Configuring the agent

Deploying the application on Kubernetes

Summary

12

Where to Go from Here

Important topics for big data in Kubernetes

Kubernetes monitoring and application monitoring

Building a service mesh

Security considerations

Automated scalability

GitOps and CI/CD for Kubernetes

Kubernetes cost control

What about team skills?

Key skills for monitoring

Building a service mesh

Security considerations

Automated scalability

Skills for GitOps and CI/CD

Cost control skills

Summary

Index

Other Books You May Enjoy

Part 1:Docker and Kubernetes

In this part, you will learn about the fundamentals of containerization and Kubernetes. You will start by understanding the basics of containers and how to build and run Docker images. This will provide you with a solid foundation for working with containerized applications. Next, you will dive into the Kubernetes architecture, exploring its components, features, and core concepts such as pods, deployments, and services. With this knowledge, you will be well equipped to navigate the Kubernetes ecosystem. Finally, you will get hands-on experience by deploying local and cloud-based Kubernetes clusters and then deploying applications you built earlier onto these clusters.

This part contains the following chapters:

Chapter 1, Getting Started with ContainersChapter 2, Kubernetes ArchitectureChapter 3, Kubernetes – Hands On