E-Book
38,39 €

Data Engineering with Python E-Book

Paul Crickard

0,0

38,39 €

Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.

Mehr erfahren.

Herausgeber: Packt Publishing
Kategorie: Wissenschaft und neue Technologien
Sprache: Englisch

Beschreibung

Data engineering provides the foundation for data science and analytics, and forms an important part of all businesses. This book will help you to explore various tools and methods that are used for understanding the data engineering process using Python.
The book will show you how to tackle challenges commonly faced in different aspects of data engineering. You’ll start with an introduction to the basics of data engineering, along with the technologies and frameworks required to build data pipelines to work with large datasets. You’ll learn how to transform and clean data and perform analytics to get the most out of your data. As you advance, you'll discover how to work with big data of varying complexity and production databases, and build data pipelines. Using real-world examples, you’ll build architectures on which you’ll learn how to deploy data pipelines.
By the end of this Python book, you’ll have gained a clear understanding of data modeling techniques, and will be able to confidently build data engineering pipelines for tracking data, running quality checks, and making necessary changes in production.

Details

Das E-Book können Sie in Legimi-Apps oder einer beliebigen App lesen, die das folgende Format unterstützen:

EPUB

MOBI

Seitenzahl: 327

Veröffentlichungsjahr: 2020

Bewertungen

0,0

Rezensionen(0 Rezensionen)

Leseprobe

Data Engineering with Python

Work with massive datasets to design data models and automate data pipelines using Python

Paul Crickard

BIRMINGHAM—MUMBAI

Data Engineering with Python

All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing or its dealers and distributors, will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book.

Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

Commissioning Editor: Sunith Shetty

Acquisition Editor: Reshma Raman

Senior Editor: Roshan Kumar

Content Development Editor: Athikho Sapuni Rishana

Technical Editor: Manikandan Kurup

Copy Editor: Safis Editing

Project Coordinator: Aishwarya Mohan

Proofreader: Safis Editing

Indexer: Tejal Daruwale Soni

Production Designer: Alishon Mendonca

First published: October 2020

Production reference: 1231020

Published by Packt Publishing Ltd.

Livery Place

35 Livery Street

Birmingham

B3 2PB, UK.

ISBN 978-1-83921-418-9

www.packt.com

Packt.com

Subscribe to our online digital library for full access to over 7,000 books and videos, as well as industry leading tools to help you plan your personal development and advance your career. For more information, please visit our website.

Why subscribe?

Spend less time learning and more time coding with practical eBooks and videos from over 4,000 industry professionalsImprove your learning with Skill Plans built especially for youGet a free eBook or video every monthFully searchable for easy access to vital informationCopy and paste, print, and bookmark content

Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at packt.com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at [email protected] for more details.

At www.packt.com, you can also read a collection of free technical articles, sign up for a range of free newsletters, and receive exclusive discounts and offers on Packt books and eBooks.

Contributors

About the author

Paul Crickard is the author of Leaflet.js Essentials and co-author of Mastering Geospatial Analysis with Python, and is also the Chief Information Officer at the Second Judicial District Attorney’s Office in Albuquerque, New Mexico.

With a master’s degree in political science and a background in community and regional planning, he combines rigorous social science theory and techniques to technology projects. He has presented at the New Mexico Big Data and Analytics Summit and the ExperienceIT NM Conference. He has given talks on data to the New Mexico Big Data Working Group, Sandia National Labs, and the New Mexico Geographic Information Council.

About the reviewers

Stefan Marwah has enjoyed programming for over ten years, which led him to undertake a bachelor’s degree in computer science from the reputable Monash University. During his time at the university, he built a mobile application that detected if an elderly person had Alzheimer’s disease with help of natural language processing, speech recognition, and neural networks, which secured him an award from Microsoft. He has experience in both engineering and analytical roles that are rooted in his passion for leveraging data and artificial intelligence to make impactful decisions within different organizations. He currently works as a data engineer and also teaches part-time on topics around data science at Step Function Coaching.

Andre Sionek is a data engineer at Gousto, in London. He started his career by founding his own company, Polyteck, a free science and technology magazine for university students. But he only jumped into the world of data and analytics during an internship at the collections department of a Brazilian bank. He also worked with credit modeling for a large cosmetics group and for start-ups before moving to London. He regularly teaches data engineering courses, focusing on infrastructure as code and productionization. He also writes about data for his blog and competes on Kaggle sometimes.

Miles Obare is a software engineer at Microsoft in the Azure team. He is currently building tools that enable customers to migrate their server workloads to the cloud. He also builds real-time, scalable backend systems and data pipelines for enterprise customers. Formerly, he worked as a data engineer for a financial start-up, where his role involved developing and deploying data pipelines and machine learning models to production. His areas of expertise include distributed systems, computer architecture, and data engineering. He holds a bachelor’s degree in electrical and computer engineering from Jomo Kenyatta University and contributes to open source projects in his free time.

Packt is searching for authors like you

If you’re interested in becoming an author for Packt, please visit authors.packtpub.com and apply today. We have worked with thousands of developers and tech professionals, just like you, to help them share their insight with the global tech community. You can make a general application, apply for a specific hot topic that we are recruiting an author for, or submit your own idea.

Preface

Section 1: Building Data Pipelines – Extract Transform, and Load

Chapter 1: What is Data Engineering?

What data engineers do 4

Required skills and knowledge to be a data engineer 6

Data engineering versus data science 7

Data engineering tools 7

Programming languages 8

Databases 8

Data processing engines 10

Data pipelines 11

Summary 15

Chapter 2: Building Our Data Engineering Infrastructure

Installing and configuring Apache NiFi 18

A quick tour of NiFi 20

PostgreSQL driver 27

Installing and configuring Apache Airflow 27

Installing and configuring Elasticsearch 34

Installing and configuring Kibana 36

Installing and configuring PostgreSQL 41

Installing pgAdmin 4 41

A tour of pgAdmin 4 42

Summary 44

Chapter 3: Reading and Writing Files

Writing and reading files in Python 46

Writing and reading CSVs 46

Reading and writing CSVs using pandas DataFrames 49

Writing JSON with Python 51

Building data pipelines in Apache Airflow 55

Handling files using NiFi processors 61

Working with CSV in NiFi 62

Working with JSON in NiFi 68

Summary 72

Chapter 4: Working with Databases

Inserting and extracting relational data in Python 74

Inserting data into PostgreSQL 75

Inserting and extracting NoSQL database data in Python 83

Installing Elasticsearch 84

Inserting data into Elasticsearch 84

Building data pipelines in Apache Airflow 91

Setting up the Airflow boilerplate 92

Running the DAG 94

Handling databases with NiFi processors 96

Extracting data from PostgreSQL 97

Running the data pipeline 100

Summary 101

Chapter 5: Cleaning, Transforming, and Enriching Data

Performing exploratory data analysis in Python 104

Downloading the data 104

Basic data exploration 104

Handling common data issues using pandas 114

Drop rows and columns 115

Creating and modifying columns 118

Enriching data 123

Cleaning data using Airflow 125

Summary 128

Chapter 6: Building a 311 Data Pipeline

Building the data pipeline 130

Mapping a data type 130

Triggering a pipeline 131

Querying SeeClickFix 132

Transforming the data for Elasticsearch 135

Getting every page 136

Backfilling data 138

Building a Kibana dashboard 139

Creating visualizations 140

Creating a dashboard 146

Summary 150

Section 2:Deploying Data Pipelines in Production

Chapter 7: Features of a Production Pipeline

Staging and validating data 156

Staging data 156

Validating data with Great Expectations 161

Building idempotent data pipelines 178

Building atomic data pipelines 179

Summary 181

Chapter 8: Version Control with the NiFi Registry

Installing and configuring the NiFi Registry 184

Installing the NiFi Registry 184

Configuring the NiFi Registry 186

Using the Registry in NiFi 187

Adding the Registry to NiFi 188

Versioning your data pipelines 189

Using git-persistence with the NiFi Registry 194

Summary 199

Chapter 9: Monitoring Data Pipelines

Monitoring NiFi using the GUI 201

Monitoring NiFi with the status bar 202

Monitoring NiFi with processors 210

Using Python with the NiFi REST API 214

Summary 220

Chapter 10: Deploying Data Pipelines

Finalizing your data pipelines for production 222

Backpressure 222

Improving processor groups 225

Using the NiFi variable registry 230

Deploying your data pipelines 232

Using the simplest strategy 232

Using the middle strategy 234

Using multiple registries 237

Summary 238

Chapter 11: Building a Production Data Pipeline

Creating a test and production environment 240

Creating the databases 240

Populating a data lake 243

Building a production data pipeline 244

Reading the data lake 245

Scanning the data lake 247

Inserting the data into staging 248

Querying the staging database 249

Validating the staging data 250

Insert Warehouse 254

Deploying a data pipeline in production 255

Summary 256

Section 3:Beyond Batch – Building Real-Time Data Pipelines

Chapter 12: Building a Kafka Cluster

Creating ZooKeeper and Kafka clusters 260

Downloading Kafka and setting up the environment 261

Configuring ZooKeeper and Kafka 262

Starting the ZooKeeper and Kafka clusters 265

Testing the Kafka cluster 265

Testing the cluster with messages 266

Summary 267

Chapter 13: Streaming Data with Apache Kafka

Understanding logs 270

Understanding how Kafka uses logs 272

Topics 272

Kafka producers and consumers 273

Building data pipelines with Kafka and NiFi 275

The Kafka producer 276

The Kafka consumer 278

Differentiating stream processing from batch processing 282

Producing and consuming with Python 284

Writing a Kafka producer in Python 284

Writing a Kafka consumer in Python 286

Summary 288

Chapter 14: Data Processing with Apache Spark

Installing and running Spark 290

Installing and configuring PySpark 294

Processing data with PySpark 296

Spark for data engineering 298

Summary 303

Chapter 15: Real-Time Edge Data with MiNiFi, Kafka, and Spark

Setting up MiNiFi 306

Building a MiNiFi task in NiFi 308

Summary 313

Appendix

Building a NiFi cluster 315

The basics of NiFi clustering 315

Building a NiFi cluster 316

Building a distributed data pipeline 322

Managing the distributed data pipeline 323

Summary 326

Other Books You May Enjoy

Preface

Data engineering provides the foundation for data science and analytics and constitutes an important aspect of all businesses. This book will help you to explore various tools and methods that are used to understand the data engineering process using Python.The book will show you how to tackle challenges commonly faced in different aspects of data engineering. You’ll start with an introduction to the basics of data engineering, along with the technologies and frameworks required to build data pipelines to work with large datasets. You’ll learn how to transform and clean data and perform analytics to get the most out of your data. As you advance, you’ll discover how to work with big data of varying complexity and production databases and build data pipelines. Using real-world examples, you’ll build architectures on which you’ll learn how to deploy data pipelines.

By the end of this Python book, you’ll have gained a clear understanding of data modeling techniques, and will be able to confidently build data engineering pipelines for tracking data, running quality checks, and making necessary changes in production.

Who this book is for

This book is for data analysts, ETL developers, and anyone looking to get started with, or transition to, the field of data engineering or refresh their knowledge of data engineering using Python. This book will also be useful for students planning to build a career in data engineering or IT professionals preparing for a transition. No previous knowledge of data engineering is required.

What this book covers

Chapter 1, What Is Data Engineering, defines data engineering. It will introduce you to the skills, roles, and responsibilities of a data engineer. You will also learn how data engineering fits in with other disciplines, such as data science.

Chapter 2, Building Our Data Engineering Infrastructure, explains how to install and configure the tools used throughout this book. You will install two databases – ElasticSearch and PostgreSQL – as well as NiFi, Kibana, and, of course, Python.

Chapter 3, Reading and Writing Files, provides an introduction to reading and writing files in Python as well as data pipelines in NiFi. It will focus on Comma Seperated Values (CSV) and JavaScript Object Notation (JSON) files.

Chapter 4, Working with Databases, explains the basics of working with SQL and NoSQL databases. You will query both types of databases and view the results in Python and through the use of NiFi. You will also learn how to read a file and insert it into the databases.

Chapter 5, Cleaning and Transforming Data, explains how to take the files or database queries and perform basic exploratory data analysis. This analysis will allow you to view common data problems. You will then use Python and NiFi to clean and transform the data with a view to solving those common data problems.

Chapter 6, Project – Building a 311 Data Pipeline, sets out a project in which you will build a complete data pipeline. You will learn how to read from an API and use all of the skills acquired from previous chapters. You will clean and transform the data as well as enrich it with additional data. Lastly, you will insert the data into a warehouse and build a dashboard to visualize it.

Chapter 7, Features of a Production Data Pipeline, covers what is needed in a data pipeline to make it ready for production. You will learn about atomic transactions and how to make data pipelines idempotent.

Chapter 8, Version Control Using the NiFi Registry, explains how to version control your data pipelines. You will install and configure the NiFi registry. You will also learn how to configure the registry to use GitHub as the source of your NiFi processors.

Chapter 9, Monitoring and Logging Data Pipelines, teaches you the basics of monitoring and logging data pipelines. You will learn about the features of the NiFi GUI for monitoring. You will also learn how to use NiFi processors to log and monitor performance from within your data pipelines. Lastly, you will learn the basics of the NiFi API.

Chapter 10, Deploying Your Data Pipelines, proposes a method for building test and production environments for NiFi. You will learn how to move your completed and version-controlled data pipelines into a production environment.

Chapter 11, Project – Building a Production Data Pipeline, explains how to build a production data pipeline. You will use the project from Chapter 6 and add a number of features. You will version control the data pipeline as well as adding monitoring and logging features.

Chapter 12, Building an Apache Kafka Cluster, explains how to install and configure a three-node Apache Kafka cluster. You will learn the basics of Kafka – streams, topics, and consumers.

Chapter 13, Streaming Data with Kafka, explains how, using Python, you can write to Kafka topics and how to consume that data. You will write Python code for both consumers and producers using a third-party Python library.

Chapter 14, Data Processing with Apache Spark, walks you through the installation and configuration of a three-node Apache Spark cluster. You will learn how to use Python to manipulate data in Spark. This will be reminiscent of working with pandas DataFrames from Section 1 of this book.

Chapter 15, Project – Real-Time Edge Data – Kafka, Spark, and MiNiFi, introduces MiNiFi, which is a separate project to make NiFi available on low-resource devices such as Internet of Things devices. You will build a data pipeline that sends data from MiNiFi to your NiFi instance.

TheAppendix teaches you the basics of clustering with Apache NiFi. You will learn how to distribute data pipelines and some caveats in doing so. You will also learn how to allow data pipelines to run on a single, specified node and not run distributed while in a cluster.

To get the most out of this book

You should have a basic understanding of Python. You will not be required to know any existing libraries, just a fundamental understanding of variables, functions, and how to run a program. You should also know the basics of Linux. If you can run a command in the terminal and open new terminal windows, that should be sufficient.

If you are using the digital version of this book, we advise you to type the code yourself or access the code via the GitHub repository (link available in the next section). Doing so will help you avoid any potential errors related to the copying and pasting of code.

Download the example code files

You can download the example code files for this book at https://github.com/PacktPublishing/Data-Engineering-with-Python. In case there’s an update to the code, it will be updated on the existing GitHub repository.

We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/. Check them out!

Download the color images

We also provide a PDF file that has color images of the screenshots/diagrams used in this book. You can download it here: http://www.packtpub.com/sites/default/files/downloads/9781839214189_ColorImages.pdf.

Conventions used

There are a number of text conventions used throughout this book.

Code in text: Indicates code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles. Here is an example: “Next, pass the arguments dictionary to DAG().”

A block of code is set as follows:

import datetime as dt from datetime import timedelta

from airflow import DAG from airflow.operators.bash_operator import BashOperator from airflow.operators.python_operator import PythonOperator

import pandas as pd

Any command-line input or output is written as follows:

# web properties #

nifi.web.http.port=9300

Bold: Indicates a new term, an important word, or words that you see on screen. For example, words in menus or dialog boxes appear in the text like this. Here is an example: “Click on DAG and select Tree View.”

Tips or important notes

Appear like this.

Get in touch

Feedback from our readers is always welcome.

General feedback: If you have questions about any aspect of this book, mention the book title in the subject of your message and email us at [email protected].

Errata: Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you have found a mistake in this book, we would be grateful if you would report this to us. Please visit www.packtpub.com/support/errata, selecting your book, clicking on the Errata Submission Form link, and entering the details.

Piracy: If you come across any illegal copies of our works in any form on the internet, we would be grateful if you would provide us with the location address or website name. Please contact us at [email protected] with a link to the material.

If you are interested in becoming an author: If there is a topic that you have expertise in, and you are interested in either writing or contributing to a book, please visit authors.packtpub.com.

Reviews

Please leave a review. Once you have read and used this book, why not leave a review on the site that you purchased it from? Potential readers can then see and use your unbiased opinion to make purchase decisions, we at Packt can understand what you think about our products, and our authors can see your feedback on their book. Thank you!

For more information about Packt, please visit packt.com.

Section 1: Building Data Pipelines – Extract Transform, and Load

This section will introduce you to the basics of data engineering. In this section, you will learn what data engineering is and how it relates to other similar fields, such as data science. You will cover the basics of working with files and databases in Python and using Apache NiFi. Once you are comfortable with moving data, you will be introduced to the skills required to clean and transform data. The section culminates with the building of a data pipeline to extract 311 data from SeeClickFix, transform it, and load it into another database. Lastly, you will learn the basics of building dashboards with Kibana to visualize the data you have loaded into your database.

This section comprises the following chapters:

Chapter 1, What is Data Engineering?Chapter 2, Building Our Data Engineering InfrastructureChapter 3, Reading and Writing FilesChapter 4, Working with DatabasesChapter 5, Cleaning and Transforming DataChapter 6, Building a 311 Data Pipeline

Tausende von E-Books und Hörbücher

Ihre Zahl wächst ständig und Sie haben eine Fixpreisgarantie.

Sie haben über uns geschrieben:

Data Engineering with Python E-Book

Paul Crickard

Data Engineering with Python

Data Engineering with Python

Why subscribe?

Contributors

About the author

About the reviewers

Packt is searching for authors like you

Table of Contents

Preface

Section 1: Building Data Pipelines – Extract Transform, and Load

Chapter 1: What is Data Engineering?

What data engineers do 4

Required skills and knowledge to be a data engineer 6

Data engineering versus data science 7

Data engineering tools 7

Programming languages 8

Databases 8

Data processing engines 10

Data pipelines 11

Summary 15

Chapter 2: Building Our Data Engineering Infrastructure

Installing and configuring Apache NiFi 18

A quick tour of NiFi 20

PostgreSQL driver 27

Installing and configuring Apache Airflow 27

Installing and configuring Elasticsearch 34

Installing and configuring Kibana 36

Installing and configuring PostgreSQL 41

Installing pgAdmin 4 41

A tour of pgAdmin 4 42

Summary 44

Chapter 3: Reading and Writing Files

Writing and reading files in Python 46

Writing and reading CSVs 46

Reading and writing CSVs using pandas DataFrames 49

Writing JSON with Python 51

Building data pipelines in Apache Airflow 55

Handling files using NiFi processors 61

Working with CSV in NiFi 62

Working with JSON in NiFi 68

Summary 72

Chapter 4: Working with Databases

Inserting and extracting relational data in Python 74

Inserting data into PostgreSQL 75

Inserting and extracting NoSQL database data in Python 83

Installing Elasticsearch 84

Inserting data into Elasticsearch 84

Building data pipelines in Apache Airflow 91

Setting up the Airflow boilerplate 92

Running the DAG 94

Handling databases with NiFi processors 96

Extracting data from PostgreSQL 97

Running the data pipeline 100

Summary 101

Chapter 5: Cleaning, Transforming, and Enriching Data

Performing exploratory data analysis in Python 104

Downloading the data 104

Basic data exploration 104

Handling common data issues using pandas 114

Drop rows and columns 115

Creating and modifying columns 118

Enriching data 123

Cleaning data using Airflow 125

Summary 128

Chapter 6: Building a 311 Data Pipeline

Building the data pipeline 130

Mapping a data type 130

Triggering a pipeline 131

Querying SeeClickFix 132

Transforming the data for Elasticsearch 135

Getting every page 136

Backfilling data 138

Building a Kibana dashboard 139

Creating visualizations 140

Creating a dashboard 146

Summary 150

Section 2:Deploying Data Pipelines in Production

Chapter 7: Features of a Production Pipeline