Data Engineering with Python - Paul Crickard - E-Book

Data Engineering with Python E-Book

Paul Crickard

0,0
38,39 €

-100%
Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.

Mehr erfahren.
Beschreibung

Data engineering provides the foundation for data science and analytics, and forms an important part of all businesses. This book will help you to explore various tools and methods that are used for understanding the data engineering process using Python.
The book will show you how to tackle challenges commonly faced in different aspects of data engineering. You’ll start with an introduction to the basics of data engineering, along with the technologies and frameworks required to build data pipelines to work with large datasets. You’ll learn how to transform and clean data and perform analytics to get the most out of your data. As you advance, you'll discover how to work with big data of varying complexity and production databases, and build data pipelines. Using real-world examples, you’ll build architectures on which you’ll learn how to deploy data pipelines.
By the end of this Python book, you’ll have gained a clear understanding of data modeling techniques, and will be able to confidently build data engineering pipelines for tracking data, running quality checks, and making necessary changes in production.

Das E-Book können Sie in Legimi-Apps oder einer beliebigen App lesen, die das folgende Format unterstützen:

EPUB
MOBI

Seitenzahl: 327

Veröffentlichungsjahr: 2020

Bewertungen
0,0
0
0
0
0
0
Mehr Informationen
Mehr Informationen
Legimi prüft nicht, ob Rezensionen von Nutzern stammen, die den betreffenden Titel tatsächlich gekauft oder gelesen/gehört haben. Wir entfernen aber gefälschte Rezensionen.



Data Engineering with Python

Work with massive datasets to design data models and automate data pipelines using Python

Paul Crickard

BIRMINGHAM—MUMBAI

Data Engineering with Python

Copyright © 2020 Packt Publishing

All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing or its dealers and distributors, will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book.

Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

Commissioning Editor: Sunith Shetty

Acquisition Editor: Reshma Raman

Senior Editor: Roshan Kumar

Content Development Editor: Athikho Sapuni Rishana

Technical Editor: Manikandan Kurup

Copy Editor: Safis Editing

Project Coordinator: Aishwarya Mohan

Proofreader: Safis Editing

Indexer: Tejal Daruwale Soni

Production Designer: Alishon Mendonca

First published: October 2020

Production reference: 1231020

Published by Packt Publishing Ltd.

Livery Place

35 Livery Street

Birmingham

B3 2PB, UK.

ISBN 978-1-83921-418-9

www.packt.com

Packt.com

Subscribe to our online digital library for full access to over 7,000 books and videos, as well as industry leading tools to help you plan your personal development and advance your career. For more information, please visit our website.

Why subscribe?

Spend less time learning and more time coding with practical eBooks and videos from over 4,000 industry professionalsImprove your learning with Skill Plans built especially for youGet a free eBook or video every monthFully searchable for easy access to vital informationCopy and paste, print, and bookmark content

Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at packt.com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at [email protected] for more details.

At www.packt.com, you can also read a collection of free technical articles, sign up for a range of free newsletters, and receive exclusive discounts and offers on Packt books and eBooks.

Contributors

About the author

Paul Crickard is the author of Leaflet.js Essentials and co-author of Mastering Geospatial Analysis with Python, and is also the Chief Information Officer at the Second Judicial District Attorney’s Office in Albuquerque, New Mexico.

With a master’s degree in political science and a background in community and regional planning, he combines rigorous social science theory and techniques to technology projects. He has presented at the New Mexico Big Data and Analytics Summit and the ExperienceIT NM Conference. He has given talks on data to the New Mexico Big Data Working Group, Sandia National Labs, and the New Mexico Geographic Information Council.

About the reviewers

Stefan Marwah has enjoyed programming for over ten years, which led him to undertake a bachelor’s degree in computer science from the reputable Monash University. During his time at the university, he built a mobile application that detected if an elderly person had Alzheimer’s disease with help of natural language processing, speech recognition, and neural networks, which secured him an award from Microsoft. He has experience in both engineering and analytical roles that are rooted in his passion for leveraging data and artificial intelligence to make impactful decisions within different organizations. He currently works as a data engineer and also teaches part-time on topics around data science at Step Function Coaching.

Andre Sionek is a data engineer at Gousto, in London. He started his career by founding his own company, Polyteck, a free science and technology magazine for university students. But he only jumped into the world of data and analytics during an internship at the collections department of a Brazilian bank. He also worked with credit modeling for a large cosmetics group and for start-ups before moving to London. He regularly teaches data engineering courses, focusing on infrastructure as code and productionization. He also writes about data for his blog and competes on Kaggle sometimes.

Miles Obare is a software engineer at Microsoft in the Azure team. He is currently building tools that enable customers to migrate their server workloads to the cloud. He also builds real-time, scalable backend systems and data pipelines for enterprise customers. Formerly, he worked as a data engineer for a financial start-up, where his role involved developing and deploying data pipelines and machine learning models to production. His areas of expertise include distributed systems, computer architecture, and data engineering. He holds a bachelor’s degree in electrical and computer engineering from Jomo Kenyatta University and contributes to open source projects in his free time.

Packt is searching for authors like you

If you’re interested in becoming an author for Packt, please visit authors.packtpub.com and apply today. We have worked with thousands of developers and tech professionals, just like you, to help them share their insight with the global tech community. You can make a general application, apply for a specific hot topic that we are recruiting an author for, or submit your own idea.

Table of Contents

Preface

Section 1: Building Data Pipelines – Extract Transform, and Load

Chapter 1: What is Data Engineering?

What data engineers do  4

Required skills and knowledge to be a data engineer   6

Data engineering versus data science  7

Data engineering tools  7

Programming languages  8

Databases  8

Data processing engines  10

Data pipelines  11

Summary  15

Chapter 2: Building Our Data Engineering Infrastructure

Installing and configuring Apache NiFi  18

A quick tour of NiFi  20

PostgreSQL driver  27

Installing and configuring Apache Airflow  27

Installing and configuring Elasticsearch  34

Installing and configuring Kibana  36

Installing and configuring PostgreSQL  41

Installing pgAdmin 4  41

A tour of pgAdmin 4  42

Summary  44

Chapter 3: Reading and Writing Files

Writing and reading files in Python  46

Writing and reading CSVs  46

Reading and writing CSVs using pandas DataFrames  49

Writing JSON with Python  51

Building data pipelines in Apache Airflow  55

Handling files using NiFi processors  61

Working with CSV in NiFi  62

Working with JSON in NiFi  68

Summary  72

Chapter 4: Working with Databases

Inserting and extracting relational data in Python  74

Inserting data into PostgreSQL  75

Inserting and extracting NoSQL database data in Python  83

Installing Elasticsearch  84

Inserting data into Elasticsearch  84

Building data pipelines in Apache Airflow  91

Setting up the Airflow boilerplate  92

Running the DAG  94

Handling databases with NiFi processors  96

Extracting data from PostgreSQL  97

Running the data pipeline  100

Summary  101

Chapter 5: Cleaning, Transforming, and Enriching Data

Performing exploratory data analysis in Python  104

Downloading the data  104

Basic data exploration  104

Handling common data issues using pandas   114

Drop rows and columns  115

Creating and modifying columns  118

Enriching data   123

Cleaning data using Airflow  125

Summary  128

Chapter 6: Building a 311 Data Pipeline

Building the data pipeline  130

Mapping a data type  130

Triggering a pipeline  131

Querying SeeClickFix  132

Transforming the data for Elasticsearch  135

Getting every page  136

Backfilling data  138

Building a Kibana dashboard  139

Creating visualizations  140

Creating a dashboard  146

Summary  150

Section 2:Deploying Data Pipelines in Production

Chapter 7: Features of a Production Pipeline

Staging and validating data  156

Staging data  156

Validating data with Great Expectations  161

Building idempotent data pipelines  178

Building atomic data pipelines  179

Summary  181

Chapter 8: Version Control with the NiFi Registry

Installing and configuring the NiFi Registry  184

Installing the NiFi Registry  184

Configuring the NiFi Registry  186

Using the Registry in NiFi  187

Adding the Registry to NiFi   188

Versioning your data pipelines  189

Using git-persistence with the NiFi Registry  194

Summary  199

Chapter 9: Monitoring Data Pipelines

Monitoring NiFi using the GUI  201

Monitoring NiFi with the status bar  202

Monitoring NiFi with processors  210

Using Python with the NiFi REST API  214

Summary  220

Chapter 10: Deploying Data Pipelines

Finalizing your data pipelines for production  222

Backpressure  222

Improving processor groups  225

Using the NiFi variable registry  230

Deploying your data pipelines  232

Using the simplest strategy  232

Using the middle strategy  234

Using multiple registries  237

Summary  238

Chapter 11: Building a Production Data Pipeline

Creating a test and production environment  240

Creating the databases  240

Populating a data lake  243

Building a production data pipeline  244

Reading the data lake  245

Scanning the data lake  247

Inserting the data into staging  248

Querying the staging database  249

Validating the staging data  250

Insert Warehouse  254

Deploying a data pipeline in production  255

Summary  256

Section 3:Beyond Batch – Building Real-Time Data Pipelines

Chapter 12: Building a Kafka Cluster

Creating ZooKeeper and Kafka clusters  260

Downloading Kafka and setting up the environment  261

Configuring ZooKeeper and Kafka  262

Starting the ZooKeeper and Kafka clusters  265

Testing the Kafka cluster  265

Testing the cluster with messages  266

Summary  267

Chapter 13: Streaming Data with Apache Kafka

Understanding logs  270

Understanding how Kafka uses logs  272

Topics  272

Kafka producers and consumers  273

Building data pipelines with Kafka and NiFi  275

The Kafka producer  276

The Kafka consumer  278

Differentiating stream processing from batch processing  282

Producing and consuming with Python  284

Writing a Kafka producer in Python  284

Writing a Kafka consumer in Python  286

Summary  288

Chapter 14: Data Processing with Apache Spark

Installing and running Spark  290

Installing and configuring PySpark  294

Processing data with PySpark  296

Spark for data engineering  298

Summary  303

Chapter 15: Real-Time Edge Data with MiNiFi, Kafka, and Spark

Setting up MiNiFi  306

Building a MiNiFi task in NiFi  308

Summary  313

Appendix

Building a NiFi cluster  315

The basics of NiFi clustering  315

Building a NiFi cluster   316

Building a distributed data pipeline  322

Managing the distributed data pipeline  323

Summary  326

Other Books You May Enjoy

Preface

Data engineering provides the foundation for data science and analytics and constitutes an important aspect of all businesses. This book will help you to explore various tools and methods that are used to understand the data engineering process using Python.The book will show you how to tackle challenges commonly faced in different aspects of data engineering. You’ll start with an introduction to the basics of data engineering, along with the technologies and frameworks required to build data pipelines to work with large datasets. You’ll learn how to transform and clean data and perform analytics to get the most out of your data. As you advance, you’ll discover how to work with big data of varying complexity and production databases and build data pipelines. Using real-world examples, you’ll build architectures on which you’ll learn how to deploy data pipelines.

By the end of this Python book, you’ll have gained a clear understanding of data modeling techniques, and will be able to confidently build data engineering pipelines for tracking data, running quality checks, and making necessary changes in production.

Who this book is for

This book is for data analysts, ETL developers, and anyone looking to get started with, or transition to, the field of data engineering or refresh their knowledge of data engineering using Python. This book will also be useful for students planning to build a career in data engineering or IT professionals preparing for a transition. No previous knowledge of data engineering is required.

What this book covers

Chapter 1, What Is Data Engineering, defines data engineering. It will introduce you to the skills, roles, and responsibilities of a data engineer. You will also learn how data engineering fits in with other disciplines, such as data science.

Chapter 2, Building Our Data Engineering Infrastructure, explains how to install and configure the tools used throughout this book. You will install two databases – ElasticSearch and PostgreSQL – as well as NiFi, Kibana, and, of course, Python.

Chapter 3, Reading and Writing Files, provides an introduction to reading and writing files in Python as well as data pipelines in NiFi. It will focus on Comma Seperated Values (CSV) and JavaScript Object Notation (JSON) files.

Chapter 4, Working with Databases, explains the basics of working with SQL and NoSQL databases. You will query both types of databases and view the results in Python and through the use of NiFi. You will also learn how to read a file and insert it into the databases.

Chapter 5, Cleaning and Transforming Data, explains how to take the files or database queries and perform basic exploratory data analysis. This analysis will allow you to view common data problems. You will then use Python and NiFi to clean and transform the data with a view to solving those common data problems.

Chapter 6, Project – Building a 311 Data Pipeline, sets out a project in which you will build a complete data pipeline. You will learn how to read from an API and use all of the skills acquired from previous chapters. You will clean and transform the data as well as enrich it with additional data. Lastly, you will insert the data into a warehouse and build a dashboard to visualize it.

Chapter 7, Features of a Production Data Pipeline, covers what is needed in a data pipeline to make it ready for production. You will learn about atomic transactions and how to make data pipelines idempotent.

Chapter 8, Version Control Using the NiFi Registry, explains how to version control your data pipelines. You will install and configure the NiFi registry. You will also learn how to configure the registry to use GitHub as the source of your NiFi processors.

Chapter 9, Monitoring and Logging Data Pipelines, teaches you the basics of monitoring and logging data pipelines. You will learn about the features of the NiFi GUI for monitoring. You will also learn how to use NiFi processors to log and monitor performance from within your data pipelines. Lastly, you will learn the basics of the NiFi API.

Chapter 10, Deploying Your Data Pipelines, proposes a method for building test and production environments for NiFi. You will learn how to move your completed and version-controlled data pipelines into a production environment.

Chapter 11, Project – Building a Production Data Pipeline, explains how to build a production data pipeline. You will use the project from Chapter 6 and add a number of features. You will version control the data pipeline as well as adding monitoring and logging features.

Chapter 12, Building an Apache Kafka Cluster, explains how to install and configure a three-node Apache Kafka cluster. You will learn the basics of Kafka – streams, topics, and consumers.

Chapter 13, Streaming Data with Kafka, explains how, using Python, you can write to Kafka topics and how to consume that data. You will write Python code for both consumers and producers using a third-party Python library.

Chapter 14, Data Processing with Apache Spark, walks you through the installation and configuration of a three-node Apache Spark cluster. You will learn how to use Python to manipulate data in Spark. This will be reminiscent of working with pandas DataFrames from Section 1 of this book.

Chapter 15, Project – Real-Time Edge Data – Kafka, Spark, and MiNiFi, introduces MiNiFi, which is a separate project to make NiFi available on low-resource devices such as Internet of Things devices. You will build a data pipeline that sends data from MiNiFi to your NiFi instance.

TheAppendix teaches you the basics of clustering with Apache NiFi. You will learn how to distribute data pipelines and some caveats in doing so. You will also learn how to allow data pipelines to run on a single, specified node and not run distributed while in a cluster.

To get the most out of this book

You should have a basic understanding of Python. You will not be required to know any existing libraries, just a fundamental understanding of variables, functions, and how to run a program. You should also know the basics of Linux. If you can run a command in the terminal and open new terminal windows, that should be sufficient.

If you are using the digital version of this book, we advise you to type the code yourself or access the code via the GitHub repository (link available in the next section). Doing so will help you avoid any potential errors related to the copying and pasting of code.

Download the example code files

You can download the example code files for this book at https://github.com/PacktPublishing/Data-Engineering-with-Python. In case there’s an update to the code, it will be updated on the existing GitHub repository.

We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/. Check them out!

Download the color images

We also provide a PDF file that has color images of the screenshots/diagrams used in this book. You can download it here: http://www.packtpub.com/sites/default/files/downloads/9781839214189_ColorImages.pdf.

Conventions used

There are a number of text conventions used throughout this book.

Code in text: Indicates code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles. Here is an example: “Next, pass the arguments dictionary to DAG().”

A block of code is set as follows:

import datetime as dt from datetime import timedelta

from airflow import DAG from airflow.operators.bash_operator import BashOperator from airflow.operators.python_operator import PythonOperator

import pandas as pd

Any command-line input or output is written as follows:

# web properties #

nifi.web.http.port=9300

Bold: Indicates a new term, an important word, or words that you see on screen. For example, words in menus or dialog boxes appear in the text like this. Here is an example: “Click on DAG and select Tree View.”

Tips or important notes

Appear like this.

Get in touch

Feedback from our readers is always welcome.

General feedback: If you have questions about any aspect of this book, mention the book title in the subject of your message and email us at [email protected].

Errata: Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you have found a mistake in this book, we would be grateful if you would report this to us. Please visit www.packtpub.com/support/errata, selecting your book, clicking on the Errata Submission Form link, and entering the details.

Piracy: If you come across any illegal copies of our works in any form on the internet, we would be grateful if you would provide us with the location address or website name. Please contact us at [email protected] with a link to the material.

If you are interested in becoming an author: If there is a topic that you have expertise in, and you are interested in either writing or contributing to a book, please visit authors.packtpub.com.

Reviews

Please leave a review. Once you have read and used this book, why not leave a review on the site that you purchased it from? Potential readers can then see and use your unbiased opinion to make purchase decisions, we at Packt can understand what you think about our products, and our authors can see your feedback on their book. Thank you!

For more information about Packt, please visit packt.com.

Section 1: Building Data Pipelines – Extract Transform, and Load

This section will introduce you to the basics of data engineering. In this section, you will learn what data engineering is and how it relates to other similar fields, such as data science. You will cover the basics of working with files and databases in Python and using Apache NiFi. Once you are comfortable with moving data, you will be introduced to the skills required to clean and transform data. The section culminates with the building of a data pipeline to extract 311 data from SeeClickFix, transform it, and load it into another database. Lastly, you will learn the basics of building dashboards with Kibana to visualize the data you have loaded into your database.

This section comprises the following chapters:

Chapter 1, What is Data Engineering?Chapter 2, Building Our Data Engineering InfrastructureChapter 3, Reading and Writing FilesChapter 4, Working with DatabasesChapter 5, Cleaning and Transforming DataChapter 6, Building a 311 Data Pipeline