34,79 €
Apache Spark is a unified data analytics engine designed to process huge volumes of data quickly and efficiently. PySpark is Apache Spark's Python language API, which offers Python developers an easy-to-use scalable data analytics framework.
Essential PySpark for Scalable Data Analytics starts by exploring the distributed computing paradigm and provides a high-level overview of Apache Spark. You'll begin your analytics journey with the data engineering process, learning how to perform data ingestion, cleansing, and integration at scale. This book helps you build real-time analytics pipelines that help you gain insights faster. You'll then discover methods for building cloud-based data lakes, and explore Delta Lake, which brings reliability to data lakes. The book also covers Data Lakehouse, an emerging paradigm, which combines the structure and performance of a data warehouse with the scalability of cloud-based data lakes. Later, you'll perform scalable data science and machine learning tasks using PySpark, such as data preparation, feature engineering, and model training and productionization. Finally, you'll learn ways to scale out standard Python ML libraries along with a new pandas API on top of PySpark called Koalas.
By the end of this PySpark book, you'll be able to harness the power of PySpark to solve business problems.
Das E-Book können Sie in Legimi-Apps oder einer beliebigen App lesen, die das folgende Format unterstützen:
Seitenzahl: 401
A beginner's guide to harnessing the power and ease of PySpark 3
Sreeram Nudurupati
BIRMINGHAM—MUMBAI
Copyright © 2021 Packt Publishing
All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.
Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing or its dealers and distributors, will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book.
Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.
Publishing Product Manager: Aditi Gour
Senior Editor: Mohammed Yusuf Imaratwale
Content Development Editor: Sean Lobo
Technical Editor: Manikandan Kurup
Copy Editor: Safis Editing
Project Coordinator: Aparna Ravikumar Nair
Proofreader: Safis Editing
Indexer: Sejal Dsilva
Production Designer: Joshua Misquitta
First published: October 2021
Production reference: 1230921
Published by Packt Publishing Ltd.
Livery Place
35 Livery Street
Birmingham
B3 2PB, UK.
ISBN 978-1-80056-887-7
www.packt.com
Sreeram Nudurupati is a data analytics professional with years of experience in designing and optimizing data analytics pipelines at scale. He has a history of helping enterprises, as well as digital natives, build optimized analytics pipelines by using the knowledge of the organization, infrastructure environment, and current technologies.
Karen J. Yang is a software engineer with computer science training in programming, data engineering, data science, and cloud computing. Her technical skills include Python, Java, Spark, Kafka, Hive, Docker, Kubernetes, CI/CD, Spring Boot, machine learning, data visualization, and cloud computing with AWS, GCP, and Databricks. As an author for Packt Publishing LLC, she has created three online instructional video courses, namely Apache Spark in 7 Days, Time Series Analysis with Python 3.x, and Fundamentals of Statistics and Visualization in Python. In her technical reviewer role, she has reviewed Mastering Big Data Analytics with PySpark, The Applied Data Science Workshop, and, most recently, Essential PySpark for Scalable Data Analytics.
Ayan Putatunda has 11 years of experience working with data-related technologies. He is currently working as an engineering manager of data engineering at Noodle.ai, based in San Francisco, California. He has held multiple positions, such as tech lead, principal data engineer, and senior data engineer, at Noodle.ai. He specializes in utilizing SQL and Python for large-scale distributed data processing. Before joining Noodle.ai he worked at Cognizant for 9 years in countries such as India, Argentina, and the US. At Cognizant, he worked with a wide range of data-related tools and technologies. Ayan holds a bachelor's degree in computer science from India and a master's degree in data science from the University of Illinois at Urbana-Champaign, USA.
This section introduces the uninitiated to the Distributed Computing paradigm and shows how Spark became the de facto standard for big data processing.
Upon completion of this section, you will be able to ingest data from various data sources, cleanse it, integrate it, and write it out to persistent storage such as a data lake in a scalable and distributed manner. You will also be able to build real-time analytics pipelines and perform change data capture in a data lake. You will understand the key differences between the ETL and ELT ways of data processing, and how ELT evolved for the cloud-based data lake world. This section also introduces you to Delta Lake to make cloud-based data lakes more reliable and performant. You will understand the nuances of Lambda architecture as a means to perform simultaneous batch and real-time analytics and how Apache Spark combined with Delta Lake greatly simplifies Lambda architecture.
This section includes the following chapters:
Chapter 1, Distributed Computing PrimerChapter 2, Data IngestionChapter 3, Data Cleansing and IntegrationChapter 4, Real-Time Data Analytics