26,99 €
Most data engineers know that performance issues in a distributed computing environment can easily lead to issues impacting the overall efficiency and effectiveness of data engineering tasks. While Python remains a popular choice for data engineering due to its ease of use, Scala shines in scenarios where the performance of distributed data processing is paramount.
This book will teach you how to leverage the Scala programming language on the Spark framework and use the latest cloud technologies to build continuous and triggered data pipelines. You’ll do this by setting up a data engineering environment for local development and scalable distributed cloud deployments using data engineering best practices, test-driven development, and CI/CD. You’ll also get to grips with DataFrame API, Dataset API, and Spark SQL API and its use. Data profiling and quality in Scala will also be covered, alongside techniques for orchestrating and performance tuning your end-to-end pipelines to deliver data to your end users.
By the end of this book, you will be able to build streaming and batch data pipelines using Scala while following software engineering best practices.
Das E-Book können Sie in Legimi-Apps oder einer beliebigen App lesen, die das folgende Format unterstützen:
Seitenzahl: 310
Veröffentlichungsjahr: 2024
Data Engineering with Scala and Spark
Build streaming and batch pipelines that process massive amounts of data using Scala
Eric Tome
Rupam Bhattacharjee
David Radford
Copyright © 2024 Packt Publishing
All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.
Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the authors, nor Packt Publishing or its dealers and distributors, will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book.
Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.
Associate Group Product Manager: Kaustubh Manglurkar
Associate Publishing Product Manager: Arindam Majumder
Book Project Manager: Kirti Pisat
Senior Editor: Tiksha Lad
Technical Editor: Kavyashree K S
Copy Editor: Safis Editing
Proofreader: Safis Editing
Indexer: Subalakshmi Govindhan
Production Designer: Alishon Mendonca
DevRel Marketing Coordinator: Nivedita Singh
First published: January 2024
Production reference: 1160124
Published by Packt Publishing Ltd.
Grosvenor House
11 St Paul’s Square
Birmingham
B3 1RB, UK.
ISBN 978-1-80461-258-3
www.packtpub.com
Eric Tome has over 25 years of experience working with data. He has contributed to and led teams that ingested, cleansed, standardized, and prepared data used by business intelligence, data science, and operations teams. He has a background in mathematics and currently works as a senior solutions architect at Databricks, helping customers solve their data and AI challenges.
Rupam Bhattacharjee works as a lead data engineer at IBM. He has architected and developed data pipelines, processing massive structured and unstructured data using Spark and Scala for on-premises Hadoop and K8s clusters on the public cloud. He has a degree in electrical engineering.
David Radford has worked in big data for over 10 years, with a focus on cloud technologies. He led consulting teams for several years, completing a migration from legacy systems to modern data stacks. He holds a master’s degree in computer science and works as a senior solutions architect at Databricks.
Bartosz Konieczny is a freelance data engineer enthusiast who has been coding for 15+ years. He has held various senior hands-on positions that helped him work on many data engineering problems in batch and stream processing, such as sessionization, data ingestion, data cleansing, ordered data processing, and data migration. He enjoys solving data challenges with public cloud services and open source technologies, especially Apache Spark, Apache Kafka, Apache Airflow, and Delta Lake. In addition, he blogs at waitingforcode.com.
Palanivelrajan is a highly passionate data evangelist with 19.5 years of experience in the data and analytics space. He has rich experience in architecting, developing, and delivering modern data platforms, data lakes, data warehouses, business intelligence, data science, and ML solutions. For the last five years, he has worked in engineering management, and he has 12+ years of experience in data architecture (big data and the cloud). He has built data teams and data practices and has been active in presales, planning, roadmaps, and executions. He has hired, managed, and mentored data engineers, data analysts, data scientists, ML engineers, and data architects. He has worked as a data engineering manager and a data architect for Sigmoid analytics, Nike, the Data Team, and Tata Communications.
In this part, Chapter 1 introduces Scala’s significance in data engineering, emphasizing its type safety and native compatibility with Spark. It covers key concepts such as functional programming, objects, classes, and higher-order functions. Moving to Chapter 2, it contrasts two data engineering environments – a cloud-based setup offering portability and easy access with associated maintenance costs, and a local machine utilization option requiring setup but avoiding cloud expenses.
This part has the following chapters:
Chapter 1, Scala Essentials for Data EngineersChapter 2, Environment Setup