E-Book
28,79 €

Data Wrangling on AWS E-Book

Navnit Shukla

0,0

28,79 €

Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.

Mehr erfahren.

Herausgeber: Packt Publishing
Kategorie: Wissenschaft und neue Technologien
Sprache: Englisch

Beschreibung

Data wrangling is the process of cleaning, transforming, and organizing raw, messy, or unstructured data into a structured format. It involves processes such as data cleaning, data integration, data transformation, and data enrichment to ensure that the data is accurate, consistent, and suitable for analysis. Data Wrangling on AWS equips you with the knowledge to reap the full potential of AWS data wrangling tools.
First, you’ll be introduced to data wrangling on AWS and will be familiarized with data wrangling services available in AWS. You’ll understand how to work with AWS Glue DataBrew, AWS data wrangler, and AWS Sagemaker. Next, you’ll discover other AWS services like Amazon S3, Redshift, Athena, and Quicksight. Additionally, you’ll explore advanced topics such as performing Pandas data operation with AWS data wrangler, optimizing ML data with AWS SageMaker, building the data warehouse with Glue DataBrew, along with security and monitoring aspects.
By the end of this book, you’ll be well-equipped to perform data wrangling using AWS services.

Details

Das E-Book können Sie in Legimi-Apps oder einer beliebigen App lesen, die das folgende Format unterstützen:

EPUB

Seitenzahl: 477

Veröffentlichungsjahr: 2023

Bewertungen

0,0

Rezensionen(0 Rezensionen)

Leseprobe

Data Wrangling on AWS

Clean and organize complex data for analysis

Navnit Shukla

Sankar M

Sam Palani

BIRMINGHAM—MUMBAI

Data Wrangling on AWS

All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the authors, nor Packt Publishing or its dealers and distributors, will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book.

Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

Publishing Product Manager: Heramb Bhavsar

Book Project Manager: Kirti Pisat

Content Development Editor: Joseph Sunil

Technical Editor: Sweety Pagaria

Copy Editor: Safis Editing

Proofreader: Safis Editing

Indexer: Sejal Dsilva

Production Designer: Shyam Sundar Korumilli

DevRel Marketing Coordinator: Nivedita Singh

First published: July 2023

Production reference: 1280723

Published by Packt Publishing Ltd.

Grosvenor House

11 St Paul’s Square

Birmingham

B3 1RB, UK.

ISBN 978-1-80181-090-6

www.packtpub.com

I am grateful to my grandmother, Radhika Shukla, my mother, Asha Shukla, my grandfather, I.D. Shukla, and my father, N.K. Shukla, for their unwavering support and the sacrifices they have made. They have been a constant source of inspiration and have shown me the true power of determination. I would also like to express my heartfelt gratitude to my loving wife, Anchal Dubey, and my son, Anav Shukla, for being my steadfast companions on our shared life journey. Their love and encouragement have fueled my ambitions and brought immense joy to my life.

- Navnit Shukla

I am thankful to my wife, Krithiha Kumar, for her unwavering support and motivation in all my endeavors. I would thank my son and daughter for their love, curiosity, creativity, and boundless energy, which has inspired me every day to explore further and dream big.

- Sankar Sundaram

Contributors

About the authors

Navnit Shukla is a passionate individual with a love for technology. He pursued engineering studies in electronics and communication at Dr. A.P.J. Abdul Kalam Technical University in Lucknow (formerly U.P.T.U), setting the stage for his future endeavors. At the age of 25, he ventured across the oceans to the United States, where he now resides in the beautiful Orange County, California, alongside his loving wife and son. Immersed in the world of IT since 2009, he has gained extensive experience and knowledge within the industry. For the past four years, he has worked with Amazon Web Services (AWS) as a specialist solution architect – analytics. In this role, he is entrusted with the responsibility of assisting organizations of all sizes in architecting, building, and seamlessly operating solutions on the dynamic AWS Cloud platform. Beyond his professional pursuits, he finds immense joy in spending quality time with his family and friends, exploring new places, creating precious memories, and embarking on exciting adventures.

I am deeply grateful to those who have been by my side, offering unwavering support throughout my journey. I extend a heartfelt thank you to my beloved wife, Anchal Dubey, and my devoted parents, whose encouragement and belief in me have been invaluable.

Sankar M has been working in the IT industry since 2007, specializing in databases, data warehouses, and the analytics space for many years. As a specialized data architect, he helps customers build and modernize data architectures and helps them build secure, scalable, and performant data lake, database, and data warehouse solutions. Prior to joining AWS, he worked with multiple customers in implementing complex data architectures.

Sam Palani has over 18+ years as a developer, data engineer, data scientist, startup co-founder, and IT leader. He holds a master's in Business Administration with a dual specialization in Information Technology. His professional career spans 5 countries across financial services, management consulting, and the technology industries. He is currently Sr Leader for Machine Learning and AI at Amazon Web Services, where he is responsible for multiple lines of the business, product strategy, and thought leadership. Sam is also a practicing data scientist, a writer with multiple publications, a speaker at key industry conferences, and an active open-source contributor. Outside work, he loves hiking, photography, experimenting with food, and reading.

About the reviewer

Naresh Rohra is an accomplished data modeler and lead analyst with a passion for harnessing the potential of cutting-edge technologies to unlock valuable insights from complex datasets. With over a decade of experience in the field of data modeling, he has successfully navigated the dynamic landscape of data wrangling and analysis, earning him a reputation as a leading expert. He has expertly designed various data models related to OLTP and OLAP systems. He has worked with renowned organizations like TCS, Cognizant, and Tech Mahindra.

I extend heartfelt thanks to my colleagues and family for their constant encouragement throughout my book review journey. To my beloved daughter, Ginisha, you are the luminescent star that adorns my night sky.

Roja Boina is an engineering senior advisor. She provides data-driven solutions to enterprises. She takes pride in her get-it-done attitude. She enjoys defining requirements, designing, developing, testing, and delivering backend applications, and communicating data through visualizations. She is very passionate about being a Women in Tech (WIT) advocate and being a part of WIT communities. Outside of her 9-5 job, she loves to volunteer for non-profits and mentor fellow women in STEM. She believes in having a growth mindset.

Preface

Part 1: Unleashing Data Wrangling with AWS

1 Getting Started with Data Wrangling

Introducing data wrangling

The 80-20 rule of data analysis

Advantages of data wrangling

The steps involved in data wrangling

Data discovery

Data structuring

Data cleaning

Data enrichment

Data validation

Data publishing

Best practices for data wrangling

Identifying the business use case

Identifying the data source and bringing the right data

Identifying your audience

Options available for data wrangling on AWS

AWS Glue DataBrew

SageMaker Data Wrangler

AWS SDK for pandas

Summary

Part 2: Data Wrangling with AWS Tools

2 Introduction to AWS Glue DataBrew

Why AWS Glue DataBrew?

AWS Glue DataBrew’s basic building blocks

Getting started with AWS Glue DataBrew

Understanding the pricing of AWS Glue DataBrew

Using AWS Glue DataBrew for data wrangling

Identifying the dataset

Downloading the sample dataset

Data discovery – creating an AWS Glue DataBrew profile for a dataset

Data cleaning and enrichment – AWS Glue DataBrew transforms

Data validation – performing data quality checks using AWS Glue DataBrew

Data publication – fixing data quality issues

Event-driven data quality check using Glue DataBrew

Data protection with AWS Glue DataBrew

Encryption at rest

Encryption in transit

Identifying and handling PII

Data lineage and data publication

Summary

3 Introducing AWS SDK for pandas

AWS SDK for pandas

Building blocks of AWS SDK for pandas

Arrow

pandas

Boto3

Customizing, building, and installing AWS SDK for pandas for different use cases

Standard and custom installation on your local machine or Amazon EC2

Standard and custom installation with Lambda functions

Standard and custom installation for AWS Glue jobs

Standard and custom installation on Amazon SageMaker notebooks

Configuration options for AWS SDK for pandas

Setting up global variables

Common use cases for configuring

The features of AWS SDK for pandas with different AWS services

Amazon S3

Amazon Athena

RDS databases

Redshift

Summary

4 Introduction to SageMaker Data Wrangler

Data import

Data orchestration

Data transformation

Insights and data quality

Data analysis

Data export

SageMaker Studio setup prerequisites

Prerequisites

Studio domain

Studio onboarding steps

Summary

Part 3: AWS Data Management and Analysis

5 Working with Amazon S3

What is big data?

5 Vs of big data

What is a data lake?

Building a data lake on Amazon S3

Advantages of building a data lake on Amazon S3

Design principles to design a data lake on Amazon S3

Data lake layouts

Organizing and structuring data within an Amazon S3 data lake

Process of building a data lake on Amazon S3

Selecting the right file format for a data lake

Selecting the right compression method for a data lake

Choosing the right partitioning strategy for a data lake

Configuring Amazon S3 Lifecycle for a data lake

Optimizing the number of files and the size of each file

Challenges and considerations when building a data lake on Amazon S3

Summary

6 Working with AWS Glue

What is Apache Spark?

Apache Spark architecture

Apache Spark framework

Resilient Distributed Datasets

Datasets and DataFrames

Data discovery with AWS Glue

AWS Glue Data Catalog

Glue Connections

AWS Glue crawlers

Table stats

Data ingestion using AWS Glue ETL

AWS GlueContext

DynamicFrame

AWS Glue Job bookmarks

AWS Glue Triggers

AWS Glue interactive sessions

AWS Glue Studio

Ingesting data from object stores

Summary

7 Working with Athena

Understanding Amazon Athena

When to use SQL/Spark analysis options?

Advanced data discovery and data structuring with Athena

SQL-based data discovery with Athena

Using CTAS for data structuring

Enriching data from multiple sources using Athena

Enriching data using Athena SQL joins

Setting up data federation for source databases

Enriching data with data federation

Setting up a serverless data quality pipeline with Athena

Implementing data quality rules in Athena

Amazon DynamoDB as a metadata store for data quality pipelines

Serverless data quality pipeline

Automating the data quality pipeline

Summary

8 Working with QuickSight

Introducing Amazon QuickSight and its concepts

Data discovery with QuickSight

QuickSight-supported data sources and setup

Data discovery with QuickSight analysis

QuickSight Q and AI-based data analysis/discovery

Data visualization with QuickSight

Visualization and charts with QuickSight

Embedded analytics

Summary

Part 4: Advanced Data Manipulation and ML Data Optimization

9 Building an End-to-End Data-Wrangling Pipeline with AWS SDK for Pandas

A solution walkthrough for sportstickets.com

Prerequisites for data ingestion

When would you use them?

Loading sample data into a source database

Data discovery

Exploring data using S3 Select commands

Access through Amazon Athena and the Glue Catalog

Data structuring

Different file formats and when to use them

Restructuring data using Pandas

Flattening nested data with Pandas

Data cleaning

Data cleansing with Pandas

Data enrichment

Pandas operations for data transformation

Data quality validation

Data quality validation with Pandas

Data quality validation integration with a data pipeline

Data visualization

Visualization with Python libraries

Summary

10 Data Processing for Machine Learning with SageMaker Data Wrangler

Technical requirements

Step 1 – logging in to SageMaker Studio

Step 2 – importing data

Exploratory data analysis

Built-in data insights

Step 3 – creating data analysis

Step 4 – adding transformations

Categorical encoding

Custom transformation

Numeric scaling

Dropping columns

Step 5 – exporting data

Training a machine learning model

Summary

Part 5:Ensuring Data Lake Security and Monitoring

11 Data Lake Security and Monitoring

Data lake security

Data lake access control

Additional options to control data lake access

AWS Lake Formation integration

Data protection

Securing your data in AWS Glue

Monitoring and auditing

Amazon CloudWatch

Monitoring an AWS Glue job using AWS Glue ETL job monitoring

Amazon CloudTrail

Summary

Index

Other Books You May Enjoy

Part 1:Unleashing Data Wrangling with AWS

This section marks the beginning of an exciting journey, where we explore the realm of data wrangling and unveil the remarkable capabilities of AWS for data manipulation and preparation. This section lays a solid foundation that provides you with insights into key concepts and essential tools that will pave the way for your data-wrangling endeavors throughout the book.

This part has the following chapter:

Chapter 1, Getting Started with Data Wrangling

1 Getting Started with Data Wrangling

In the introductory section of this book, we listed use cases regarding how organizations use data to bring value to customers. Apart from that, organizations collect a lot of other data so that they can understand the finances of customers, which helps them share it with stakeholders, including log data for security, system health checks, and customer data, which is required for working on use cases such as Customer 360s.

We talked about all these use cases and how collecting data from different data sources is required to solve them. However, from collecting data to solving these business use cases, one very important step is to clean the data. That is where data wrangling comes into the picture.

In this chapter, we are going to learn the basics of data wrangling and cover the following topics:

Introducing data wranglingThe steps involved in data wranglingBest practices for data wranglingOptions available within Amazon Web Services (AWS) to perform data wrangling

Introducing data wrangling

For organizations to become data-driven to provide value to customers or make more informed business decisions, they need to collect a lot of data from different data sources such as clickstreams, log data, transactional systems, and flat files and store them in different data stores such as data lakes, databases, and data warehouses as raw data. Once this data is stored in different data stores, it needs to be cleansed, transformed, organized, and joined from different data sources to provide more meaningful information to downstream applications such as machine learning models to provide product recommendations or look for traffic conditions. Alternatively, it can be used by business or data analytics to extract meaningful business information:

Figure 1.1: Data pipeline

The 80-20 rule of data analysis

When organizations collect data from different data sources, it is not of much use initially. It is estimated that data scientists spend about 80% of their time cleaning data. This means that only 20% of their time will be spent analyzing and creating insights from the data science process:

Figure 1.2: Work distribution of a data scientist

Now that we understand the basic concept of data wrangling, we’ll learn why it is essential, and the various benefits we get from it.

Advantages of data wrangling

If we go back to the analogy of oil, when we first extract it, it is in the form of crude oil, which is not of much use. To make it useful, it has to go through a refinery, where the crude oil is put in a distillation unit. In this distillation process, the liquids and vapors are separated into petroleum components called fractions according to their boiling points. Heavy fractions are on the bottom while light fractions are on the top, as seen here:

Figure 1.3: Crude oil processing

The following figure showcases how oil processing correlates to the data wrangling process:

Figure 1.4: The data wrangling process

Data wrangling brings many advantages:

Enhanced data quality: Data wrangling helps improve the overall quality of the data. It involves identifying and handling missing values, outliers, inconsistencies, and errors. By addressing these issues, data wrangling ensures that the data used for analysis is accurate and reliable, leading to more robust and trustworthy results.Improved data consistency: Raw data often comes from various sources or in different formats, resulting in inconsistencies in naming conventions, units of measurement, or data structure. Data wrangling allows you to standardize and harmonize the data, ensuring consistency across the dataset. Consistent data enables easier integration and comparison of information, facilitating effective analysis and interpretation.Increased data completeness: Incomplete data can pose challenges during analysis and modeling. Data wrangling methods allow you to handle missing data by applying techniques such as imputation, where missing values are estimated or filled in based on existing information. By dealing with missing data appropriately, data wrangling helps ensure a more complete dataset, reducing potential biases and improving the accuracy of analyses.Facilitates data integration: Organizations often have data spread across multiple systems and sources, making integration a complex task. Data wrangling helps in merging and integrating data from various sources, allowing analysts to work with a unified dataset. This integration facilitates a holistic view of the data, enabling comprehensive analyses and insights that might not be possible when working with fragmented data.Streamlined data transformation: Data wrangling provides the tools and techniques to transform raw data into a format suitable for analysis. This transformation includes tasks such as data normalization, aggregation, filtering, and reformatting. By streamlining these processes, data wrangling simplifies the data preparation stage, saving time and effort for analysts and enabling them to focus more on the actual analysis and interpret the results.Enables effective feature engineering: Feature engineering involves creating new derived variables or transforming existing variables to improve the performance of machine learning models. Data wrangling provides a foundation for feature engineering by preparing the data in a way that allows for meaningful transformations. By performing tasks such as scaling, encoding categorical variables, or creating interaction terms, data wrangling helps derive informative features that enhance the predictive power of models.Supports data exploration and visualization: Data wrangling often involves exploratory data analysis (EDA), where analysts gain insights and understand patterns in the data before formal modeling. By cleaning and preparing the data, data wrangling enables effective data exploration, helping analysts uncover relationships, identify trends, and visualize the data using charts, graphs, or other visual representations. These exploratory steps are crucial for forming hypotheses, making data-driven decisions, and communicating insights effectively.

Now that we have learned about the advantages of data wrangling, let’s understand the steps involved in the data wrangling process.

The steps involved in data wrangling

Similar to crude oil, raw data has to go through multiple data wrangling steps to become meaningful. In this section, we are going to learn the six-step process involved in data wrangling:

Data discoveryData structuringData cleaningData enrichmentData validationData publishing

Before we begin, it’s important to understand these activities may or may not need to be followed sequentially, or in some cases, you may skip any of these steps.

Also, keep in mind that these steps are iterative and differ for different personas, such as data analysts, data scientists, and data engineers.

As an example, data discovery for data engineers may vary from what data discovery means for a data analyst or data scientist:

Figure 1.5: The steps of the data-wrangling process

Let’s start learning about these steps in detail.

Data discovery

The first step of the data wrangling process is data discovery. This is one of the most important steps of data wrangling. In data discovery, we familiarize ourselves with the kind of data we have as raw data, what use case we are looking to solve with that data, what kind of relationships exist between the raw data, what the data format will look like, such as CSV or Parquet, what kind of tools are available for storing, transforming, and querying this data, and how we wish to organize this data, such as by folder structure, file size, partitions, and so on to make it easy to access.

Let’s understand this by looking at an example.

In this example, we will try to understand how data discovery varies based on the persona. Let’s assume we have two colleagues, James and Jean. James is a data engineer while Jean is a data analyst, and they both work for a car-selling company.

Jean is new to the organization and she is required to analyze car sales numbers for Southern California. She has reached out to James and asked him for data from the sales table from the production system.

Here is the data discovery process for Jane (a data analyst):

Jane has to identify the data she needs to generate the sales report (for example, sales transaction data, vehicle details data, customer data, and so on).Jane has to find where the sales data resides (a database, file share, CRM, and so on).Jane has to identify how much data she needs (from the last 12 months, the last month, and so on).Jane has to identify what kind of tool she is going to use (Amazon QuickSight, Power BI, and so on).Jane has to identify the format she needs the data to be in so that it works with the tools she has.Jane has to identify where she is looking to store this data – in a data lake (Amazon S3), on her desktop, a file share, and sandbox environment, and so on.

Here is the data discovery process for James (a data engineer):

Which system has requested data? For example, Amazon RDS, Salesforce CRM, Production SFTP location, and so on.How will the data be extracted? For example, using services such as Amazon DMS or AWS Glue or writing a script.What will the schedule look like? Daily, weekly, or monthly?What will the file format look like? For example, CSV, Parquet, orc, and so on.How will the data be stored in the provided store?

Data structuring

To support existing and future business use cases to serve its customers better, the organization must collect unprecedented amounts of data from different data sources and in different varieties. In modern data architecture, most of the time, the data is stored in data lakes since a data lake allows you to store all kinds of data files, whether it is structured data, unstructured data, images, audio, video, or something else, and it will be of different shapes and sizes in its raw form. When data is in its raw form, it lacks a definitive structure, which is required for it to be stored in databases or data warehouses or used to build analytics or machine learning models. At this point, it is not optimized for cost and performance.

In addition, when you work with streaming data such as clickstreams and log analytics, not all the data fields (columns) are used in analytics.

At this stage of data wrangling, we try to optimize the raw dataset for cost and performance benefits by performing partitioning and converting file types (for example, CSV into Parquet).

Once again, let’s consider our friends James and Jean to understand this.

For Jean, the data analyst, data structuring means that she is looking to do direct queries or store data in a memory store of a BI tool, in the case of Amazon QuickSight called the SPICE layer, which provides faster access to data.

For James, the data engineer, when he is extracting data from a production system and looking to store it in a data lake such as Amazon S3, he must consider what the file format will look like. He can partition it by geographical regions, such as county, state, or region, or by date – for example, year=YYYY, month=MM, and day=DD.

Data cleaning

The next step of the data wrangling process is data cleaning. The previous two steps give us an idea of how the data looks and how it is stored. In the data cleaning step, we start working with raw data to make it meaningful so that we can define future use cases.

In the data cleaning step, we try to make data meaningful by doing the following:

Removing unwanted columns, duplicate values, and filling null value columns to improve the data’s readinessPerforming data validation such as identifying missing values for mandatory columns such as First Name, Last Name, SSN, Phone No., and so onValidating or fixing data type for better optimization of storage and performanceIdentifying and fixing outliersRemoving garbage data or unwanted values, such as special characters

Both James and Jane can perform similar data cleaning tasks; however, their scale might vary. For James, these tasks must be done for the entire dataset. For Jane, they may only have to perform them on the data from Southern California, and granularity might vary as well. For James, maybe it is only limited to regions such as Southern California, Northern California, and so on, while for Jean, it might be city level or even ZIP code.

Data enrichment

Up until the data cleaning step, we were primarily working on single data sources and making them meaningful for future use. However, in the real world, most of the time, data is fragmented and stored in multiple disparate data stores, and to support use cases such as building personalization or recommendation solutions or building Customer 360s or log forensics, we need to join the data from different data stores.

For example, to build a Customer 360 solution, you need data from the Customer Relationship Manager (CRM) systems, clickstream logs, relational databases, and so on.

So, in the data enrichment step, we build the process that will enhance the raw data with relevant data obtained from different sources.

Data validation

There is a very interesting term in computer science called garbage in, garbage out (GIGO). GIGO is the concept that flawed or defective (garbage) input data produces defective output.

In other words, the quality of the output is determined by the quality of the input. So, if we provide bad data as input, we will get inaccurate results.

In the data validation step, we address this issue by performing various data quality checks:

Business validation of data accuracyValidate data securityValidate result consistency across the entire datasetValidate data quality by validating data quality checks such as the following:Number of recordsDuplicate valuesMissing valuesOutliersDistinct valuesUnique valuesCorrelation

There is a lot of overlap between data cleaning and data validation and yes, there are a lot of similarities between these two processes. However, data validation is done on the resulting dataset, while data cleaning is primarily done on the raw dataset.

Data publishing

After completing all the data wrangling steps, the data is ready to be used for analytics so that it can solve business problems.

So, the final step is to publish the data to the end user with the required access and permission.

In this step, we primarily concentrate on how the data is being exposed to the end user and where the final data gets stored – that is, in a relational database, a data warehouse, curated or user zones in a data lake, or through the Secure File Transfer Protocol (SFTP).

The choice of data storage depends on the tool through which the end user is looking to access the data. For example, if the end user is looking to access data through BI tools such as Amazon QuickSight, Power BI, Informatica, and so on, a relational data store will be an ideal choice. If it is accessed by a data scientist, ideally, it should be stored in an object store.

We will learn about the different kinds of data stores we can use to store raw and wrangled data later in this book.

In this section, we learned about the various steps of the data wrangling process through our friends James and Jean and how these steps may or may not vary based on personas. Now, let’s understand the best practices for data wrangling.

Best practices for data wrangling

There are many ways and tools available to perform data wrangling, depending on how data wrangling is performed and by whom. For example, if you are working on real-time use cases such as providing product recommendations or fraud detection, your choice of tool and process for performing data wrangling will be a lot different compared to when you are looking to build a business intelligence (BI) dashboard to show sales numbers.

Regardless of the kind of use cases you are looking to solve, some standard best practices can be applied in each case that will help make your job easier as a data wrangler.

Identifying the business use case

It’s recommended that you decide which service or tool you are looking to use for data wrangling before you write a single line of code. It is super important to identify the business use case as this will set the stage for data wrangling processes and make the job of identifying the services you are looking to use easier. For example, if you have a business use case such as analyzing HR data for small organizations where you just need to concatenate a few columns, remove a few columns, remove duplicates, remove NULL values, and so on from a small dataset that contains 10,000 records, and only a few users will be looking to access the wrangled data, then you don’t need to invest a ton of money to find a fancy data wrangling tool available on the market – you can simply use Excel sheets for your work.

However, when you have a business use case, such as processing claims data you receive from different partners where you need to work with semi-structured files such as JSON, or non-structured datasets such as XML files to extract only a few files’ data such as their claim ID and customer information, and you are looking to perform complex data wrangling processes such as joins, finding patterns using regex, and so on, then you should look to write scripts or subscribe to any enterprise-grade tool for your work.

Identifying the data source and bringing the right data

After identifying the business use case, it is important to identify which data sources are required to solve it. Identifying this source will help you choose what kind of services are required to bring the data, frequency, and end storage. For example, if you are looking to build a credit card fraud detection solution, you need to bring in credit card transaction data in real time; even cleaning and processing the data should be done in real time. Machine learning inference also needs to be run on real-time data.

Similarly, if you are building a sales dashboard, you may need to bring in data from a CRM system such as Salesforce or a transactional datastore such as Oracle, Microsoft SQL Server, and so on.

After identifying the right data sources, it is important to bring in the right data from these data sources as it will help you solve the business use cases and make the data wrangling process easy.

Identifying your audience

When you perform data wrangling, one important aspect is to identify your audience. Knowing your audience will help you identify what kind of data they are looking to consume. For example, marketing teams may have different data wrangling requirements compared to data science teams or business executives.

This will also give you an idea of where you are looking to present the data – for example, a data scientist team may need data in an object store such as Amazon S3, business analysts may need data in flat files such as CSV, BI developers may need data in a transactional data store, and business users may need data in applications.

With that, we have covered the best practices of data wrangling. Next, we will explore the different options that are available within AWS to perform data wrangling.

Options available for data wrangling on AWS

Depending on customer needs, data sources, and team expertise, AWS provides multiple options for data wrangling. In this section, we will cover the most common options that are available with AWS.

AWS Glue DataBrew

Released in 2020, AWS Glue DataBrew is a visual data preparation tool that makes it easy for you to clean and normalize data so that you can prepare it for analytics and machine learning. The visual UI provided by this service allows data analysts with no coding or scripting experience to accomplish all aspects of data wrangling. It comes with a rich set of common pre-built data transformation actions that can simplify these data wrangling activities. Similar to any Software as a service (SaaS) (https://en.wikipedia.org/wiki/Software_as_a_service), customers can start using the web UI without the need to provision any servers and only need to pay for the resources they use.

SageMaker Data Wrangler

Similar to AWS Glue DataBrew, AWS also provides SageMaker Data Wrangler, a web UI-based data wrangling service catered more toward data scientists. If the primary use case is around building a machine learning pipeline, SageMaker Data Wrangler should be the preference. It integrates directly with SageMaker Studio, where data that’s been prepared using SageMaker Data Wrangler can be fed into a data pipeline to build, train, and deploy machine learning models. It comes with pre-configured data transformations to impute missing data with means or medians, one-hot encoding, and time series-specific transformers that are required for preparing data for machine learning.

AWS SDK for pandas

For customers with a strong data integration team with coding and scripting experience, AWS SDK for pandas (https://github.com/aws/aws-sdk-pandas) is a great option. Built on top of other open source projects, it offers abstracted functions for executing typical data wrangling tasks such as loading/unloading data from various databases, data warehouses, and object data stores such as Amazon S3. AWS SDK for pandas simplifies integration with common AWS services such as Athena, Glue, Redshift, Timestream, OpenSearch, Neptune, DynamoDB, and S3. It also supports common databases such as MySQL and SQL Server.

Summary

In this chapter, we learned about the basics of data wrangling, why it is important, the steps and best practices of data wrangling, and how the data wrangling steps vary based on persona. We also talked about the different data wrangling options available in AWS.

In the upcoming chapters, we will dive deep into each of these options and learn how to use these services to perform data wrangling.

Part 2:Data Wrangling with AWS Tools

In this section, we will explore three powerful AWS tools designed to streamline data-wrangling and preparation tasks. First, we’ll delve into AWS Glue DataBrew, learning how to cleanse, transform, and enrich data to ensure high-quality and usable datasets. After that, we’ll uncover the versatility of the AWS SDK for pandas, gaining a comprehensive understanding of efficient data manipulation and preparation on the AWS platform. Finally, we’ll explore Amazon SageMaker Data Wrangler, equipping you with the skills to seamlessly preprocess and prepare data for impactful machine learning projects.

This part has the following chapters:

Chapter 2, Introduction to AWS Glue DataBrewChapter 3, Introducing AWS SDK for pandasChapter 4, Introduction to SageMaker Data Wrangler

2 Introduction to AWS Glue DataBrew

The previous chapter discussed how organizations are transitioning toward modern data architecture. A crucial aspect of modern data architecture is enabling access to data for every individual in the organization. This promotes data-driven decision-making, leading to empowered users, increased productivity, and, ultimately, higher profits and greater customer and employee satisfaction.

This chapter focuses on Amazon Web Services (AWS) Glue DataBrew and its application in data wrangling. The following topics will be covered:

The benefits of using AWS Glue DataBrewStarting to use AWS Glue DataBrewWorking with AWS Glue DataBrew

Why AWS Glue DataBrew?

Having the right tools is a crucial factor in enabling organizations to become data-driven, and AWS Glue DataBrew is one such tool. It is part of the AWS Glue family, which was introduced at re:Invent 2020.

Initially, when AWS Glue was launched in August 2017, it was targeted at developers and data engineers who were writing Apache Spark code. The goal was to provide them with a platform that offered both compute and storage resources to run their Spark code. This allowed them to take advantage of the speed and ease of use of Apache Spark, which is 100 times faster than Hadoop for large-scale data processing, while also leveraging the benefits of the cloud, such as elasticity, performance, and cost-effectiveness.

As the adoption of the public cloud increased and became more mainstream over time, AWS Glue evolved to meet the changing needs of enterprises. Initially, it was primarily used as an ETL tool, but it has since expanded to become a more comprehensive data governance tool. This expansion includes the addition of features such as a Python engine, Glue Data Catalog, Glue crawlers, and AWS Glue Studio. Additionally, the persona of users has evolved beyond just developers to now include business personnel such as data analysts, business analysts, product managers, and data scientists.

Despite the addition of new features, including a UI experience to author Glue ETL jobs and the introduction of Glue Studio in 2021, there has been a common request for greater ease of use. While these features cater primarily to performing ETL using Apache Spark, they still require some level of coding or scripting knowledge.

Business users require a tool that eliminates the delay between data producers and data consumers, between developers and business users. They seek an easy-to-use tool, such as Microsoft Excel, without the need to learn programming or scripting languages such as Scala, PySpark, Python, or SQL. Typically, in enterprises, business users rely heavily on BI developers to build dashboards based on different KPIs that enable them to make critical business decisions and identify any anomalies. However, making minor changes to KPIs, data sources, or even data can take a week or more.

Apart from ease of use, customers also need a solution that scales. While Microsoft Excel is one of the best data-wrangling tools, it lacks scalability. Although it is ideal for small datasets with thousands of records, it struggles when the data grows to tens of thousands or millions of records.

Based on interactions with organizations of all sizes, ranging from small start-ups to large enterprises looking to establish a data-driven culture or enable data as a service (DaaS), the biggest challenge is how to provide secure and high-quality data access. Customers want a tool that enables them to perform data wrangling without downloading data to their local machines, which poses significant data security risks and can cause data quality issues, particularly in verticals such as life sciences or government and financial institutions. Although AWS Glue enables the building of ETL pipelines, performing data analysis requires downloading data from data lakes (Amazon S3) or data warehouses such as Amazon Redshift or Snowflake to a local desktop, leading to data security and quality risks, or the use of third-party tools, which pose data security risks as data has to move from the AWS cloud to third-party tools via the internet, adding to the cost.

For example, many data privacy laws, such as the EU General Data Protection Regulation (GDPR), the California Consumer Privacy Act (CCPA), and the Health Insurance Portability and Accountability Act (HIPAA) of 1996, now restrict companies’ use of customer data, where to store it, and which tools they can use. For instance, traveling with a laptop with data that should not go beyond national boundaries or using non-HIPAA-compliant tools may pose legal and financial challenges for organizations.

One of the challenges in working with big data is maintaining data quality. For instance, tools such as Excel may not provide the same level of precision as Spark, especially when working with large datasets in the cloud. Moreover, few tools on the market offer an easy-to-use UI built on open source technologies such as Apache Spark.

This is where AWS Glue DataBrew comes in. It is a no-code/low-code, serverless visual data preparation tool from AWS that enables data professionals, such as data analysts, business analysts, BI developers, and data scientists, to perform data wrangling without writing any code. It provides over 300 ready-to-use transformations to automate the data-wrangling process, including data discovery, structuring, quality checks, cleaning, validation, anomaly detection, and personally identifiable information (PII) identification. This reduces the time it takes to prepare data for analytics and machine learning by up to 80%.

One key feature of AWS Glue DataBrew is its serverless nature, which provides agility by eliminating the need to manage infrastructure tasks such as resource provisioning, scaling, high availability, and server maintenance. AWS Glue DataBrew is built on the open source Apache Spark, providing the scale and speed of Spark without licensing costs. It is particularly well suited for big data projects where data is stored in a data lake such as Amazon S3.

In addition to being serverless, AWS Glue DataBrew offers advanced data profiling, visual data lineage, integration with data pipelines, and the ability to prepare data for advanced machine learning use cases.

The following figure illustrates how DataBrew works at a high level.

Figure 2.1: How DataBrew works at a high level

To lay the groundwork for building data-wrangling solutions on AWS Glue DataBrew, let us first delve into its basic building blocks.

AWS Glue DataBrew’s basic building blocks

Here are the core building blocks of DataBrew:

Projects: In DataBrew, a project encompasses all the necessary components for you to explore, comprehend, merge, clean, and standardize your data visually. Creating a project establishes a workspace that stores information about your data, the transformations applied to it, and the scheduled jobs that transform it.

Figure 2.2: Projects list in the AWS Glue DataBrew console

DataBrew datasets: With DataBrew datasets, you can create new datasets from various data sources. These sources include files uploaded from your local machine, data stored in the Amazon S3 data lake, metadata available in the AWS Glue Data Catalog crawled by the AWS Glue crawler, and data from different SaaS data sources, such as Datadog, Salesforce, and Marketo, through Amazon AppFlow. Additionally, AWS Data Exchange provides access to a broad range of third-party data products and makes them available to be used in DataBrew.

Figure 2.3: Datasets list in the AWS Glue DataBrew console

Figure 2.4: Data source list in the AWS Glue DataBrew console

Recipe: In AWS Glue DataBrew, a recipe is a set of transformation steps that you create or modify while working on a project. You can publish a recipe as a stand-alone entity. A published recipe consists only of the transformation steps, without any reference to the original data. You can share the published recipe by downloading it in YAML or JSON format. Recipes can have multiple published versions, which you publish from a project. You can also have multiple working copies of the same recipe, as you progress through multiple drafts.

Figure 2.5: Recipe lists in the AWS Glue DataBrew console

Job: In AWS Glue DataBrew, a job is a set of instructions that can be performed on a dataset or project. There are two types of jobs available:Recipe jobs: A recipe job runs all the transformations in a recipe against the dataset or project and produces an output in a different file format, which can be saved to an Amazon S3 bucket in the same or an external AWS accountProfile jobs: A profile job examines the dataset to provide information about correlation, value distribution, column statistics, and more

For recipe jobs, you can define job output settings such as the file type and Amazon S3 path to the output object. You can also compress the output. Additionally, you can associate a schedule with a job and choose how often and when you want the job to run. AWS Glue DataBrew offers a variety of scheduling options, including cron settings.

Figure 2.6: Job lists in the AWS Glue DataBrew console

DataBrew data lineage: Data lineage allows you to easily track and visualize the origin, events, and target of your datasets, providing a clear understanding of the path your data has taken.

Figure 2.7: Data linage visual in the AWS Glue DataBrew console

Data profiling: Data profiling is one of the most important features available as part of Glue DataBrew. At its core, AWS Glue DataBrew enables users to create comprehensive data profiles for their datasets effortlessly. By running a data profile job through AWS Glue DataBrew, users can obtain valuable information about the current shape of their data. This includes a deep understanding of the context of the content, the underlying structure of the data, and the intricate relationships between various data elements.

Through advanced algorithms and intelligent analysis, AWS Glue DataBrew profiles meticulously examine each data point, providing descriptive statistics, data distribution insights, and data quality assessments. They help data professionals identify potential anomalies, missing values, and inconsistencies, enabling them to make informed decisions and take necessary actions to improve data quality.

AWS Glue DataBrew profiles also offer visualizations and intuitive dashboards, presenting the data profile in an easily understandable format. This enables users to grasp complex patterns, spot trends, and extract meaningful insights from their datasets more efficiently.

With its user-friendly interface and powerful capabilities, AWS Glue DataBrew profiles empower data professionals to unlock the hidden potential of their datasets. By harnessing the insights derived from the data profile, they can make informed decisions, enhance the data quality, and drive impactful outcomes for their organizations.

According to a Harvard Business Review report, only 3% of companies’ data meets basic quality standards, and on average, 47% of newly created data records have at least one critical error (i.e., those that impact their work).

Figure 2.8: Data profile visual in the AWS Glue DataBrew console

Now that we have checked out the basics of AWS Glue DataBrew, let’s get started with using it for

Tausende von E-Books und Hörbücher

Ihre Zahl wächst ständig und Sie haben eine Fixpreisgarantie.

Sie haben über uns geschrieben:

Data Wrangling on AWS E-Book

Navnit Shukla

Data Wrangling on AWS

Data Wrangling on AWS

Contributors

About the authors

About the reviewer

Table of Contents

Preface

Part 1: Unleashing Data Wrangling with AWS

1

Getting Started with Data Wrangling

Introducing data wrangling

The 80-20 rule of data analysis

Advantages of data wrangling

The steps involved in data wrangling

Data discovery

Data structuring

Data cleaning

Data enrichment

Data validation

Data publishing

Best practices for data wrangling

Identifying the business use case

Identifying the data source and bringing the right data

Identifying your audience

Options available for data wrangling on AWS

AWS Glue DataBrew

SageMaker Data Wrangler

AWS SDK for pandas

Summary

Part 2: Data Wrangling with AWS Tools

2

Introduction to AWS Glue DataBrew

Why AWS Glue DataBrew?

AWS Glue DataBrew’s basic building blocks

Getting started with AWS Glue DataBrew

Understanding the pricing of AWS Glue DataBrew

Using AWS Glue DataBrew for data wrangling

Identifying the dataset

Downloading the sample dataset

Data discovery – creating an AWS Glue DataBrew profile for a dataset

Data cleaning and enrichment – AWS Glue DataBrew transforms

Data validation – performing data quality checks using AWS Glue DataBrew

Data publication – fixing data quality issues

Event-driven data quality check using Glue DataBrew

Data protection with AWS Glue DataBrew

Encryption at rest

Encryption in transit

Identifying and handling PII

Data lineage and data publication

Summary

3

Introducing AWS SDK for pandas

AWS SDK for pandas

Building blocks of AWS SDK for pandas

Arrow

pandas

Boto3

Customizing, building, and installing AWS SDK for pandas for different use cases

Standard and custom installation on your local machine or Amazon EC2

Standard and custom installation with Lambda functions

Standard and custom installation for AWS Glue jobs

Standard and custom installation on Amazon SageMaker notebooks

Configuration options for AWS SDK for pandas

Setting up global variables

Common use cases for configuring

The features of AWS SDK for pandas with different AWS services

Amazon S3

Amazon Athena

RDS databases

Redshift

Summary

4

Introduction to SageMaker Data Wrangler

Data import

Data orchestration

Data transformation

Insights and data quality

Data analysis