Essential PySpark for Scalable Data Analytics - Sreeram Nudurupati - E-Book

Essential PySpark for Scalable Data Analytics E-Book

Sreeram Nudurupati

0,0
34,79 €

-100%
Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.
Mehr erfahren.
Beschreibung

Apache Spark is a unified data analytics engine designed to process huge volumes of data quickly and efficiently. PySpark is Apache Spark's Python language API, which offers Python developers an easy-to-use scalable data analytics framework.
Essential PySpark for Scalable Data Analytics starts by exploring the distributed computing paradigm and provides a high-level overview of Apache Spark. You'll begin your analytics journey with the data engineering process, learning how to perform data ingestion, cleansing, and integration at scale. This book helps you build real-time analytics pipelines that help you gain insights faster. You'll then discover methods for building cloud-based data lakes, and explore Delta Lake, which brings reliability to data lakes. The book also covers Data Lakehouse, an emerging paradigm, which combines the structure and performance of a data warehouse with the scalability of cloud-based data lakes. Later, you'll perform scalable data science and machine learning tasks using PySpark, such as data preparation, feature engineering, and model training and productionization. Finally, you'll learn ways to scale out standard Python ML libraries along with a new pandas API on top of PySpark called Koalas.
By the end of this PySpark book, you'll be able to harness the power of PySpark to solve business problems.

Das E-Book können Sie in Legimi-Apps oder einer beliebigen App lesen, die das folgende Format unterstützen:

EPUB
MOBI

Seitenzahl: 401

Bewertungen
0,0
0
0
0
0
0
Mehr Informationen
Mehr Informationen
Legimi prüft nicht, ob Rezensionen von Nutzern stammen, die den betreffenden Titel tatsächlich gekauft oder gelesen/gehört haben. Wir entfernen aber gefälschte Rezensionen.



Essential PySpark for Scalable Data Analytics

A beginner's guide to harnessing the power and ease of PySpark 3

Sreeram Nudurupati

BIRMINGHAM—MUMBAI

Essential PySpark for Scalable Data Analytics

Copyright © 2021 Packt Publishing

All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing or its dealers and distributors, will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book.

Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

Publishing Product Manager: Aditi Gour

Senior Editor: Mohammed Yusuf Imaratwale

Content Development Editor: Sean Lobo

Technical Editor: Manikandan Kurup

Copy Editor: Safis Editing

Project Coordinator: Aparna Ravikumar Nair

Proofreader: Safis Editing

Indexer: Sejal Dsilva

Production Designer: Joshua Misquitta

First published: October 2021

Production reference: 1230921

Published by Packt Publishing Ltd.

Livery Place

35 Livery Street

Birmingham

B3 2PB, UK.

ISBN 978-1-80056-887-7

www.packt.com

Contributors

About the author

Sreeram Nudurupati is a data analytics professional with years of experience in designing and optimizing data analytics pipelines at scale. He has a history of helping enterprises, as well as digital natives, build optimized analytics pipelines by using the knowledge of the organization, infrastructure environment, and current technologies.

About the reviewers

Karen J. Yang is a software engineer with computer science training in programming, data engineering, data science, and cloud computing. Her technical skills include Python, Java, Spark, Kafka, Hive, Docker, Kubernetes, CI/CD, Spring Boot, machine learning, data visualization, and cloud computing with AWS, GCP, and Databricks. As an author for Packt Publishing LLC, she has created three online instructional video courses, namely Apache Spark in 7 Days, Time Series Analysis with Python 3.x, and Fundamentals of Statistics and Visualization in Python. In her technical reviewer role, she has reviewed Mastering Big Data Analytics with PySpark, The Applied Data Science Workshop, and, most recently, Essential PySpark for Scalable Data Analytics.

Ayan Putatunda has 11 years of experience working with data-related technologies. He is currently working as an engineering manager of data engineering at Noodle.ai, based in San Francisco, California. He has held multiple positions, such as tech lead, principal data engineer, and senior data engineer, at Noodle.ai. He specializes in utilizing SQL and Python for large-scale distributed data processing. Before joining Noodle.ai he worked at Cognizant for 9 years in countries such as India, Argentina, and the US. At Cognizant, he worked with a wide range of data-related tools and technologies. Ayan holds a bachelor's degree in computer science from India and a master's degree in data science from the University of Illinois at Urbana-Champaign, USA.

Table of Contents

Preface

Section 1: Data Engineering

Chapter 1: Distributed Computing Primer

Technical requirements

Distributed Computing

Introduction to Distributed Computing5

Data Parallel Processing5

Data Parallel Processing using the MapReduce paradigm6

Distributed Computing with Apache Spark

Introduction to Apache Spark8

Data Parallel Processing with RDDs9

Higher-order functions10

Apache Spark cluster architecture11

Getting started with Spark12

Big data processing with Spark SQL and DataFrames

Transforming data with Spark DataFrames15

Using SQL on Spark 18

What's new in Apache Spark 3.0?20

Summary

Chapter 2: Data Ingestion

Technical requirements

Introduction to Enterprise Decision Support Systems

Ingesting data from data sources

Ingesting from relational data sources26

Ingesting from file-based data sources27

Ingesting from message queues28

Ingesting data into data sinks

Ingesting into data warehouses30

Ingesting into data lakes31

Ingesting into NoSQL and in-memory data stores32

Using file formats for data storage in data lakes

Unstructured data storage formats34

Semi-structured data storage formats36

Structured data storage formats37

Building data ingestion pipelines in batch and real time

Data ingestion using batch processing39

Data ingestion in real time using structured streaming43

Unifying batch and real time using Lambda Architecture

Lambda Architecture48

The Batch layer49

The Speed layer49

The Serving layer50

Summary

Chapter 3: Data Cleansing and Integration

Technical requirements

Transforming raw data into enriched meaningful data

Extracting, transforming, and loading data55

Extracting, loading, and transforming data57

Advantages of choosing ELT over ETL58

Building analytical data stores using cloud data lakes

Challenges with cloud data lakes58

Overcoming data lake challenges with Delta Lake62

Consolidating data using data integration

Data consolidation via ETL and data warehousing74

Integrating data using data virtualization techniques79

Data integration through data federation79

Making raw data analytics-ready using data cleansing

Data selection to eliminate redundancies82

De-duplicating data82

Standardizing data84

Optimizing ELT processing performance with data partitioning86

Summary

Chapter 4: Real-Time Data Analytics

Technical requirements

Real-time analytics systems architecture

Streaming data sources91

Streaming data sinks93

Stream processing engines

Real-time data consumers96

Real-time analytics industry use cases

Real-time predictive analytics in manufacturing97

Connected vehicles in the automotive sector97

Financial fraud detection97

IT security threat detection98

Simplifying the Lambda Architecture using Delta Lake

Change Data Capture

Handling late-arriving data

Stateful stream processing using windowing and watermarking106

Multi-hop pipelines

Summary

Section 2: Data Science

Chapter 5: Scalable Machine Learning with PySpark

Technical requirements

ML overview

Types of ML algorithms119

Business use cases of ML121

Scaling out machine learning

Techniques for scaling ML123

Introduction to Apache Spark's ML library124

Data wrangling with Apache Spark and MLlib

Data preprocessing127

Data cleansing128

Data manipulation129

Summary

Chapter 6: Feature Engineering – Extraction, Transformation, and Selection

Technical requirements

The machine learning process

Feature extraction

Feature transformation

Transforming categorical variables136

Transforming continuous variables139

Transforming the date and time variables140

Assembling individual features into a feature vector140

Feature scaling141

Feature selection

Feature store as a central feature repository

Batch inferencing using the offline feature store144

Delta Lake as an offline feature store

Structure and metadata with Delta tables145

Schema enforcement and evolution with Delta Lake145

Support for simultaneous batch and streaming workloads146

Delta Lake time travel146

Integration with machine learning operations tools146

Online feature store for real-time inferencing147

Summary

Chapter 7: Supervised Machine Learning

Technical requirements

Introduction to supervised machine learning

Parametric machine learning153

Non-parametric machine learning153

Regression

Linear regression155

Regression using decision trees156

Classification

Logistic regression158

Classification using decision trees161

Naïve Bayes162

Support vector machines163

Tree ensembles

Regression using random forests165

Classification using random forests166

Regression using gradient boosted trees167

Classification using GBTs168

Real-world supervised learning applications

Regression applications169

Classification applications171

Summary

Chapter 8: Unsupervised Machine Learning

Technical requirements

Introduction to unsupervised machine learning

Clustering using machine learning

K-means clustering175

Hierarchical clustering using bisecting K-means177

Topic modeling using latent Dirichlet allocation178

Gaussian mixture model179

Building association rules using machine learning

Collaborative filtering using alternating least squares181

Real-world applications of unsupervised learning

Clustering applications183

Association rules and collaborative filtering applications185

Summary

Chapter 9: Machine Learning Life Cycle Management

Technical requirements

Introduction to the ML life cycle

Introduction to MLflow190

Tracking experiments with MLflow

ML model tuning198

Tracking model versions using MLflow Model Registry

Model serving and inferencing

Offline model inferencing201

Online model inferencing202

Continuous delivery for ML

Summary

Chapter 10: Scaling Out Single-Node Machine Learning Using PySpark

Technical requirements

Scaling out EDA

EDA using pandas209

EDA using PySpark210

Scaling out model inferencing

Model training using embarrassingly parallel computing

Distributed hyperparameter tuning 214

Scaling out arbitrary Python code using pandas UDF216

Upgrading pandas to PySpark using Koalas

Summary

Section 3: Data Analysis

Chapter 11: Data Visualization with PySpark

Technical requirements

Importance of data visualization

Types of data visualization tools223

Techniques for visualizing data using PySpark

PySpark native data visualizations225

Using Python data visualizations with PySpark231

Considerations for PySpark to pandas conversion

Introduction to pandas236

Converting from PySpark into pandas238

Summary

Chapter 12: Spark SQL Primer

Technical requirements

Introduction to SQL

DDL243

DML243

Joins and sub-queries244

Row-based versus columnar storage244

Introduction to Spark SQL

Catalyst optimizer246

Spark SQL data sources 247

Spark SQL language reference

Spark SQL DDL250

Spark DML251

Optimizing Spark SQL performance

Summary

Chapter 13: Integrating External Tools with Spark SQL

Technical requirements

Apache Spark as a distributed SQL engine

Introduction to Hive Thrift JDBC/ODBC Server257

Spark connectivity to SQL analysis tools

Spark connectivity to BI tools

Connecting Python applications to Spark SQL using Pyodbc

Summary

Chapter 14: The Data Lakehouse

Moving from BI to AI

Challenges with data warehouses272

Challenges with data lakes274

The data lakehouse paradigm

Key requirements of a data lakehouse276

Data lakehouse architecture277

Examples of existing lakehouse architectures278

Apache Spark-based data lakehouse architecture279

Advantages of data lakehouses

Summary

Other Books You May Enjoy

Section 1: Data Engineering

This section introduces the uninitiated to the Distributed Computing paradigm and shows how Spark became the de facto standard for big data processing.

Upon completion of this section, you will be able to ingest data from various data sources, cleanse it, integrate it, and write it out to persistent storage such as a data lake in a scalable and distributed manner. You will also be able to build real-time analytics pipelines and perform change data capture in a data lake. You will understand the key differences between the ETL and ELT ways of data processing, and how ELT evolved for the cloud-based data lake world. This section also introduces you to Delta Lake to make cloud-based data lakes more reliable and performant. You will understand the nuances of Lambda architecture as a means to perform simultaneous batch and real-time analytics and how Apache Spark combined with Delta Lake greatly simplifies Lambda architecture.

This section includes the following chapters:

Chapter 1, Distributed Computing PrimerChapter 2, Data IngestionChapter 3, Data Cleansing and IntegrationChapter 4, Real-Time Data Analytics