Applied Machine Learning and High-Performance Computing on AWS - Mani Khanuja - E-Book

Applied Machine Learning and High-Performance Computing on AWS E-Book

Mani Khanuja

0,0
34,79 €

-100%
Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.

Mehr erfahren.
Beschreibung

Machine learning (ML) and high-performance computing (HPC) on AWS run compute-intensive workloads across industries and emerging applications. Its use cases can be linked to various verticals, such as computational fluid dynamics (CFD), genomics, and autonomous vehicles.
This book provides end-to-end guidance, starting with HPC concepts for storage and networking. It then progresses to working examples on how to process large datasets using SageMaker Studio and EMR. Next, you’ll learn how to build, train, and deploy large models using distributed training. Later chapters also guide you through deploying models to edge devices using SageMaker and IoT Greengrass, and performance optimization of ML models, for low latency use cases.
By the end of this book, you’ll be able to build, train, and deploy your own large-scale ML application, using HPC on AWS, following industry best practices and addressing the key pain points encountered in the application life cycle.

Das E-Book können Sie in Legimi-Apps oder einer beliebigen App lesen, die das folgende Format unterstützen:

EPUB
MOBI

Seitenzahl: 471

Veröffentlichungsjahr: 2022

Bewertungen
0,0
0
0
0
0
0
Mehr Informationen
Mehr Informationen
Legimi prüft nicht, ob Rezensionen von Nutzern stammen, die den betreffenden Titel tatsächlich gekauft oder gelesen/gehört haben. Wir entfernen aber gefälschte Rezensionen.



Applied Machine Learning and High-Performance Computing on AWS

Accelerate the development of machine learning applications following architectural best practices

Mani Khanuja

Farooq Sabir

Shreyas Subramanian

Trenton Potgieter

BIRMINGHAM—MUMBAI

Applied Machine Learning and High-Performance Computing on AWS

Copyright © 2022 Packt Publishing

All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the authors, nor Packt Publishing or its dealers and distributors, will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book.

Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

Publishing Product Manager: Dhruv Jagdish Kataria

Senior Editor: Nathanya Dias

Content Development Editor: Manikandan Kurup

Technical Editor: Devanshi Ayare

Copy Editor: Safis Editing

Project Coordinator: Farheen Fathima

Proofreader: Safis Editing

Indexer: Rekha Nair

Production Designer: Shankar Kalbhor

Marketing Coordinator: Shifa Ansari

First published: December 2022

Production reference: 2060123

Published by Packt Publishing Ltd.

Livery Place

35 Livery Street

Birmingham

B3 2PB, UK.

ISBN 978-1-80323-701-5

www.packt.com

Contributors

About the authors

Mani Khanuja is a seasoned IT professional with over 17 years of software engineering experience. She has successfully led machine learning and artificial intelligence projects in various domains, such as forecasting, computer vision, and natural language processing. At AWS, she helps customers to build, train, and deploy large machine learning models at scale. She also specializes in data preparation, distributed model training, performance optimization, machine learning at the edge, and automating the complete machine learning life cycle to build repeatable and scalable applications.

Farooq Sabir is a research and development expert in machine learning, data science, big data, predictive analytics, computer vision, and image and video processing. He has over 10 years of professional experience.

Shreyas Subramanian helps AWS customers build and fine-tune large-scale machine learning and deep learning models, and rearchitect solutions to help improve the security, scalability, and efficiency of machine learning platforms. He also specializes in setting up massively parallel distributed training, hyperparameter optimization, and reinforcement learning solutions, and provides reusable architecture templates to solve AI and optimization use cases.

Trenton Potgieter is an expert technologist with 25 years of both local and international experience across multiple aspects of an organization; from IT to sales, engineering, and consulting, on the cloud and on-premises. He has a proven ability to analyze, assess, recommend, and design appropriate solutions that meet key business criteria, as well as present and teach them from engineering to executive levels.

About the reviewers

Ram Vittal has over 20 years of experience engineering software solutions for solving complex challenges across various business domains. Ram started his career with mainframes and then moved on to building distributed systems using Java technologies. Ram started his cloud journey in 2015 and has helped enterprise customers migrate, optimize, and scale their workloads on AWS. As a principal machine learning solutions architect, Ram has helped customers solve challenges across areas such as security, governance, and big data for building machine learning platforms. Ram has delivered thought leadership on big data, machine learning, and cloud strategies. Ram holds 11 AWS certifications and has a master’s degree in computer engineering.

Chakravarthy Nagarajan is a technical evangelist with 21 years of industry experience in machine learning, big data, and high-performance computing. He is currently working as a Principal AI/ML Specialist Solutions Architect at AWS in Bay Area, USA. He helps customers to solve real-world complex business problems by building prototypes with end-to-end AI/ML solutions on cloud, and edge devices. His specialization includes computer vision, natural language processing, time series forecasting, and personalization. He is also a public speaker and has published multiple blogs and white papers on HPC and AI/ML. On the academic front, he completed his MBA from United Institute, Brussels, and has multiple certifications in AI and ML.

Anna Astori holds a master’s degree in computational linguistics and artificial intelligence from Brandeis University. Over the years, Anna has worked on multiple large-scale machine learning and data science applications for companies such as Amazon and Decathlon. Anna is an AWS Certified Developer and Solutions Architect. She speaks at conferences and podcasts, reviews talk proposals for tech conferences, and writes about Python and machine learning for curated publications on Medium. She is currently a co-director of the Women Who Code Boston network.

Kevin Sayers is a research scientist based in Colorado with expertise in high-performance computing (HPC). He holds a master’s degree in bioinformatics. His work has primarily focused on scientific workflows and he is an open source contributor to a number of workflow tools. He has previously worked with university HPC centers and a national lab supporting the HPC user community. He currently works in cloud HPC bringing customer HPC and machine learning workloads to the cloud.

Table of Contents

Preface

Part 1: Introducing High-Performance Computing

1

High-Performance Computing Fundamentals

Why do we need HPC?

Limitations of on-premises HPC

Barrier to innovation

Reduced efficiency

Lost opportunities

Limited scalability and elasticity

Benefits of doing HPC on the cloud

Drives innovation

Enables secure collaboration among distributed teams

Amplifies operational efficiency

Optimizes performance

Optimizes cost

Driving innovation across industries with HPC

Life sciences and healthcare

AVs

Supply chain optimization

Summary

Further reading

2

Data Management and Transfer

Importance of data management

Challenges of moving data into the cloud

How to securely transfer large amounts of data into the cloud

AWS online data transfer services

AWS DataSync

AWS Transfer Family

Amazon S3 Transfer Acceleration

Amazon Kinesis

AWS Snowcone

AWS offline data transfer services

Process for ordering a device from AWS Snow Family

Summary

Further reading

3

Compute and Networking

Introducing the AWS compute ecosystem

General purpose instances

Compute optimized instances

Accelerated compute instances

Memory optimized instances

Storage optimized instances

Amazon Machine Images (AMIs)

Containers on AWS

Serverless compute on AWS

Networking on AWS

CIDR blocks and routing

Networking for HPC workloads

Selecting the right compute for HPC workloads

Pattern 1 – a standalone instance

Pattern 2 – using AWS ParallelCluster

Pattern 3 – using AWS Batch

Pattern 4 – hybrid architecture

Pattern 5 – Container-based distributed processing

Pattern 6 – serverless architecture

Best practices for HPC workloads

Summary

References

4

Data Storage

Technical requirements

AWS services for storing data

Amazon Simple Storage Service (S3)

Amazon Elastic File System (EFS)

Amazon EBS

Amazon FSx

Data security and governance

IAM

Data protection

Data encryption

Logging and monitoring

Resilience

Tiered storage for cost optimization

Amazon S3 storage classes

Amazon EFS storage classes

Choosing the right storage option for HPC workloads

Summary

Further reading

Part 2: Applied Modeling

5

Data Analysis

Technical requirements

Exploring data analysis methods

Gathering the data

Understanding the data structure

Describing the data

Visualizing the data

Reviewing the data analytics life cycle

Reviewing the AWS services for data analysis

Unifying the data into a common store

Creating a data structure for analysis

Visualizing the data at scale

Choosing the right AWS service

Analyzing large amounts of structured and unstructured data

Setting up EMR and SageMaker Studio

Analyzing large amounts of structured data

Analyzing large amounts of unstructured data

Processing data at scale on AWS

Cleaning up

Summary

6

Distributed Training of Machine Learning Models

Technical requirements

Building ML systems using AWS

Introducing the fundamentals of distributed training

Reviewing the SageMaker distributed data parallel strategy

Reviewing the SageMaker model data parallel strategy

Reviewing a hybrid data parallel and model parallel strategy

Executing a distributed training workload on AWS

Executing distributed data parallel training on Amazon SageMaker

Executing distributed model parallel training on Amazon SageMaker

Summary

7

Deploying Machine Learning Models at Scale

Managed deployment on AWS

Amazon SageMaker managed model deployment options

The variety of compute resources available

Cost-effective model deployment

Blue/green deployments

Inference recommender

MLOps integration

Model registry

Elastic inference

Deployment on edge devices

Choosing the right deployment option

Using batch inference

Using real-time endpoints

Using asynchronous inference

Batch inference

Creating a transformer object

Creating a batch transform job for carrying out inference

Optimizing a batch transform job

Real-time inference

Hosting a machine learning model as a real-time endpoint

Asynchronous inference

The high availability of model endpoints

Deployment on multiple instances

Endpoints autoscaling

Endpoint modification without disruption

Blue/green deployments

All at once

Canary

Linear

Summary

References

8

Optimizing and Managing Machine Learning Models for Edge Deployment

Technical requirements

Understanding edge computing

Reviewing the key considerations for optimal edge deployments

Efficiency

Performance

Reliability

Security

Designing an architecture for optimal edge deployments

Building the edge components

Building the ML model

Deploying the model package

Summary

9

Performance Optimization for Real-Time Inference

Technical requirements

Reducing the memory footprint of DL models

Pruning

Quantization

Model compilation

Key metrics for optimizing models

Choosing the instance type, load testing, and performance tuning for models

Observing the results

Summary

10

Data Visualization

Data visualization using Amazon SageMaker Data Wrangler

SageMaker Data Wrangler visualization options

Adding visualizations to the data flow in SageMaker Data Wrangler

Data flow

Amazon’s graphics-optimized instances

Benefits and key features of Amazon’s graphics-optimized instances

Summary

Further reading

Part 3: Driving Innovation Across Industries

11

Computational Fluid Dynamics

Technical requirements

Introducing CFD

Reviewing best practices for running CFD on AWS

Using AWS ParallelCluster

Using CFD Direct

Discussing how ML can be applied to CFD

Summary

References

12

Genomics

Technical requirements

Managing large genomics data on AWS

Designing architecture for genomics

Applying ML to genomics

Protein secondary structure prediction for protein sequences

Summary

13

Autonomous Vehicles

Technical requirements

Introducing AV systems

AWS services supporting AV systems

Designing an architecture for AV systems

ML applied to AV systems

Model development

Step 1 – build and push the CARLA container to Amazon ECR

Step 2 – configure and run CARLA on RoboMaker

Summary

References

14

Numerical Optimization

Introduction to optimization

Goal or objective function

Variables

Constraints

Modeling an optimization problem

Optimization algorithm

Local and global optima

Common numerical optimization algorithms

Random restart hill climbing

Simulated annealing

Tabu search

Evolutionary methods

Example use cases of large-scale numerical optimization problems

Traveling salesperson optimization problem

Worker dispatch optimization

Assembly line optimization

Numerical optimization using high-performance compute on AWS

Commercial optimization solvers

Open source optimization solvers

Numerical optimization patterns on AWS

Machine learning and numerical optimization

Summary

Further reading

Index

Other Books You May Enjoy

Part 1: Introducing High-Performance Computing

The objective of Part 1 is to introduce the concepts of high-performance computing (HPC) and the art of possibility, and, most importantly, to make you understand that what was once confined to large enterprises, government bodies, or academic institutions is now reachable, as well as how industries are leveraging it to drive innovation.

This part comprises the following chapters:

Chapter 1, High-Performance Computing FundamentalsChapter 2, Data Management and TransferChapter 3, Compute and NetworkingChapter 4, Data Storage

1

High-Performance Computing Fundamentals

High-Performance Computing (HPC) impacts every aspect of your life, from your morning coffee to driving a car to get to the office, knowing the weather forecast, your vaccinations, movies that you watch, flights that you take, games that you play, and many other aspects. Many of our actions leave a digital footprint, leading to the generation of massive amounts of data. In order to process such data, we need a large amount of processing power. HPC, also known as accelerated computing, aggregates the computing power from a cluster of nodes and divides the work among various interconnected processors, to achieve much higher performance than could be achieved by using a single computer or machine, as shown in Figure 1.1. This helps in solving complex scientific and engineering problems, in critical business applications such as drug discovery, flight simulations, supply chain optimization, financial risk analysis, and so on:

Figure 1.1 – HPC

For example, drug discovery is a data-intensive process, which involves computationally heavy calculations to simulate how virus protein binds with human protein. This is an extremely expensive process and may take weeks or months to finish. With the unification of Machine Learning (ML) with accelerated computing, researchers can simulate drug interactions with protein with higher speed and accuracy. This leads to faster experimentation and significantly reduces the time to market.

In this chapter, we will learn the fundamentals and importance of HPC, followed by technological advancements in the area. We will understand the constraints and how developers can benefit from the elasticity of the cloud while still optimizing costs to innovate faster to gain a competitive business advantage.

In this chapter, we will cover the following topics:

Why do we need HPC?Limitations of on-premises HPCBenefits of doing HPC on the cloudDriving innovation across industries with HPC

Why do we need HPC?

According to Statista, the rate of growth of data globally is forecast to increase rapidly and reached 64.2 zettabytes in 2020. By 2025, the volume of data is estimated to grow to more than 180 zettabytes. Due to the COVID-19 pandemic, data growth in 2020 reached a new high as more people were learning online and working remotely from home. As data is continuously increasing, the need to be able to analyze and process it also increases. This is where HPC is a useful mechanism. It helps organizations to think beyond their existing capabilities and explore possibilities with advanced computing technologies. Today HPC applications, which were once confined to large enterprises and academia, are trending across a wide range of industries. Some of these industries include material sciences, manufacturing, product quality improvement, genomics, numerical optimization, computational fluid dynamics, and many more. The list of applications for HPC will continue to increase, as cloud infrastructure is making it accessible to more organizations irrespective of their size, while still optimizing cost, helping to innovate faster and gain a competitive advantage.

Before we take a deeper look into doing HPC on the cloud, let’s understand the limitations of running HPC applications on-premises, and how we can overcome them by using specialized HPC services provided by the cloud.

Limitations of on-premises HPC

HPC applications are often based on complex models trained on a large amount of data, which require high-performing hardware such as Graphical Processing Units (GPUs) and software for distributing the workload among different machines. Some applications may need parallel processing while others may require low-latency and high-throughput networking. Similarly, applications such as gaming and video analysis may need performance acceleration using a fast input or output subsystem and GPUs. Catering to all of the different types of HPC applications on-premises might be daunting in terms of cost and maintenance.

Some of the well-known challenges include, but are not limited to, the following:

High upfront capital investmentLong procurement cyclesMaintaining the infrastructure over its life cycleTechnology refreshes Forecasting the annual budget and capacity requirement

Due to the above-mentioned constraints, planning for an HPC system can be a grueling process, Return On Investment (ROI) for which might be difficult to justify. This can be a barrier to innovation, with slow growth, reduced efficiency, lost opportunities, and limited scalability and elasticity. Let’s understand the impact of each of these in detail.

Barrier to innovation

The constraints of on-premises infrastructure can limit the system design, which will be more focused on the availability of the hardware instead of the business use case. You might not consider some new ideas if they are not supported by the existing infrastructure, thus obstructing your creativity and hindering innovation within the organization.

Reduced efficiency

Once you finish developing the various components of the system, you might have to wait in long prioritized queues to test your jobs, which might take weeks, even if it takes only a few hours to run. On-premises infrastructure is designed to capitalize on the utilization of expensive hardware, often resulting in very convoluted policies for prioritizing the execution of jobs, thus decreasing your productivity and ability to innovate.

Lost opportunities

In order to take full advantage of the latest technology, organizations have to refresh their hardware. Earlier, the typical refresh cycle of three years was enough to stay current, to meet the demands of HPC workloads. However, due to fast technological advancements and a faster pace of innovation, organizations need to refresh their infrastructure more often, otherwise, it might have a larger downstream business impact in terms of revenue. For example, technologies such as Artificial Intelligence (AI), ML, data visualization, risk analysis of financial markets, and so on, are pushing the limits of on-premises infrastructure. Moreover, due to the advent of the cloud, a lot of these technologies are cloud native, and deliver higher performance on large datasets when running in the cloud, especially with workloads that use transient data.

Limited scalability and elasticity

HPC applications rely heavily on infrastructure elements such as containers, GPUs, and serverless technologies, which are not readily available in an on-premises environment, and often have a long procurement and budget approval process. Moreover, maintaining these environments, making sure they are fully utilized, and even upgrading the OS or software packages, requires skills and dedicated resources. Deploying different types of HPC applications on the same hardware is very limiting in terms of scalability and flexibility and does not provide you with the right tools for the job.

Now that we understand the limitations of doing HPC on-premises, let’s see how we can overcome them by running HPC workloads on the cloud.

Benefits of doing HPC on the cloud

With virtually unlimited capacity on the cloud, you can move beyond the constraints of on-premises HPC. You can reimagine new approaches based on the business use case, experiment faster, and gain insights from large amounts of data, without the need for costly on-premises upgrades and long procurement cycles. You can run complex simulations and deep learning models in the cloud and quickly move from idea to market using scalable compute capacity, high-performance storage, and high-throughput networking. In summary, it enables you to drive innovation, collaborate among distributed teams, improve operational efficiency, and optimize performance and cost. Let’s take a deeper look into each of these benefits.

Drives innovation

Moving HPC workloads to the cloud, helps you break barriers to innovation, and opens the door for unlimited possibilities. You can quickly fail forward, try out thousands of experiments, and make business decisions based on data. The benefit that I really like is that, once you solve the problem, it remains solved and you don’t have to revisit it after a system upgrade or a technology refresh. It eliminates reworking and the maintenance of hardware, lets you focus on the business use case, and enables you to quickly design, develop, and test new products. The elasticity offered by the cloud, allows you to grow and shrink the infrastructure as per the requirements. Additionally, cloud-based services offer native features, which remove the heavy lifting and let you adopt tested and verified HPC applications, without having to write and manage all the utility libraries on your own.

Enables secure collaboration among distributed teams

HPC workloads on the cloud allow you to share designs, data, visualizations, and other artifacts globally with your teams, without the need to duplicate or proliferate sensitive data. For example, building a digital twin (a real-time digital counterpart of a physical object) can help in predictive maintenance. It can get the state of the object in real time and it monitors and diagnoses the object (asset) to optimize its performance and utilization. To build a digital twin, a cross-team skill set is needed, which might be remotely located to capture data from various IoT sensors, performing extensive what-if analysis and meticulously building a simulation model to develop an accurate representation of the physical object. The cloud provides a collaboration platform, where different teams can interact with a simulation model in near real time, without moving or copying data to different locations, and ensures compliance with rapidly changing industry regulations. Moreover, you can use native features and services offered by the cloud, for example, AWS IoT TwinMaker, which can use the existing data from multiple sources, create virtual replicas of physical systems, and combine 3D models to give you a holistic view of your operations faster and with less effort. With a broad global presence of HPC technologies on the cloud, it allows you to work together with your remote teams across different geographies without trading off security and cost.

Amplifies operational efficiency

Operational efficiency means that you are able to support the development and execution of workloads, gain insights, and continuously improve the processes that are supporting your applications. The design principles and best practices include automating processes, making frequent and reversible changes, refining your operations frequently, and being able to anticipate and recover from failures. Having your HPC applications on the cloud enables you to do that, as you can version control your infrastructure as code, similar to your application code, and integrate it with your Continuous Integration and Continuous Delivery (CI/CD) pipelines. Additionally, with on-demand access to unlimited compute capacity, you will no longer have to wait in long queues for your jobs to run. You can skip the wait and focus on solving business critical problems, providing you with the right tools for the right job.

Optimizes performance

Performance optimization involves the ability to use resources efficiently and to be able to maintain them as the application changes or evolves. Some of the best practices include making the implementation easier for your team, using serverless architectures where possible, and being able to experiment faster. For example, developing ML models and integrating them into your application requires special expertise, which can be alleviated by using out-of-the-box models provided by cloud vendors, such as services in the AI and ML stack by AWS. Moreover, you can leverage the compute, storage, and networking services specially designed for HPC and eliminate long procurement cycles for specialized hardware. You can quickly carry out benchmarking or load testing and use that data to optimize your workloads without worrying about cost, as you only pay for the amount of time you are using the resources on the cloud. We will understand this concept more in Chapters 5, Data Analysis, and Chapter 6, Distributed Training of Machine Learning Models.

Optimizes cost

Cost optimization is a continuous process of monitoring and improving resource utilization over an application’s life cycle. By adopting the pay-as-you-go consumption model and increasing or decreasing the usage depending on the business needs, you can achieve potential cost savings. You can quickly commission and decommission HPC clusters in minutes, instead of days or weeks. This lets you gain access to resources rapidly, as and when needed. You can measure the overall efficiency by calculating the business value achieved and the cost of delivery. With this data, you can make informed decisions as well as understanding the gains from increasing the application’s functionality and reducing cost.

Running HPC in the cloud helps you overcome the limitations associated with traditional on-premises infrastructure: fixed capacity, long procurement cycles, technology obsolescence, high upfront capital investment, maintaining the hardware, and applying regular Operating System (OS) and software updates. The cloud gives you unlimited HPC capacity virtually, with the latest technology to promote innovation, which helps you design your architecture based on business needs instead of available hardware, minimizes the need for job queues, and improves operational and performance efficiency while still optimizing cost.

Next, let’s see how different industries such as Autonomous Vehicles (AVs), manufacturing, media and entertainment, life sciences, and financial services are driving innovation with HPC workloads.

Driving innovation across industries with HPC

Every industry and type of HPC application poses different kinds of challenges. The HPC solutions provided by cloud vendors such as AWS help all companies, irrespective of their size, which leads to emerging HPC applications such as reinforcement learning, digital twins, supply chain optimization, and AVs.

Let’s take a look at some of the use cases in life sciences and healthcare, AV, and supply chain optimization.

Life sciences and healthcare

In life sciences and the healthcare domain, a large amount of sensitive and meaningful data is captured almost every minute of the day. Using HPC technology, we can harness this data to gain meaningful insights into critical diseases to save lives by reducing the time taken testing lab samples, drug discovery, and much more, as well as meeting the core security and compliance requirements.

The following are some of the emerging applications in the healthcare and life sciences domain.

Genomics

You can use cloud services provided by AWS to store and share genomic data securely, which helps you build and run predictive or real-time applications to accelerate the journey from genomic data to genomic insights. This helps to reduce data processing times significantly and perform casual analysis of critical diseases such as cancer and Alzheimer’s.

Imaging

Using computer vision and data integration services, you can elevate image analysis and facilitate long-term data retention. For example, by using ML to analyze MRI or X-ray scans, radiology companies can improve operational efficiency and quickly generate lab reports for their patients. Some of the technologies provided by AWS for imaging include Amazon EC2 GPU instances, AWS Batch, AWS ParallelCluster, AWS DataSync, and Amazon SageMaker, which we will discuss in detail in subsequent chapters.

Computational chemistry and structure-based drug design

Combining the state-of-the-art deep learning models for protein classification, advancements in protein structure solutions, and algorithms for describing 3D molecular models with HPC computing resources, allows you to grow and reduce the time to market drastically. For example, in a project performed by Novartis on the AWS cloud, where they were able to screen 10 million compounds against a common cancer target in less than a week, based on their internal calculations, if they had performed a similar experiment in-house, then it would have resulted in about a $40 million investment. By running this experiment on the cloud using AWS services and features, they were able to use their 39 years of computational chemistry data and knowledge. Moreover, it only took 9 hours and $4,232 to conduct the experiment, hence increasing their pace of innovation and experimentation. They were able to successfully identify three out of the ten million compounds, that were screened.

Now that we understand some of the applications in the life sciences and healthcare domain, let us discuss how the automobile and transport industry is using HPC for building AVs.

AVs

The advancement of deep learning models such as reinforcement learning, object detection, and image segmentation, as well technological advancements in compute technology, and deploying models on edge devices, have paved the way for AV. In order to design and build an AV, all the components of the system have to work in tandem, including planning, perception, and control systems. It also requires collecting and processing massive amounts of data and using it to create a feedback loop, for vehicles to adjust their state based on the changing condition of the traffic on roads in real time. This entails having high I/O performance, networking, specialized hardware coprocessors such as GPUs or Field Programmable Gate Arrays (FPGAs), as well as analytics and deep learning frameworks. Moreover, before an AV can even start testing on actual roads, it has to undergo millions of miles of simulation to demonstrate safety performance, due to the high dimensionality of the environment, which is complex and time-consuming. By using the AWS cloud’s virtually unlimited compute and storage capacity, support for advanced deep learning frameworks, and purpose-built services, you can drive a faster time to market. For example, In 2017, Lyft, an American transportation company, launched its AV division. To enhance the performance and safety of its system, it uses petabytes of data collected from its AV fleet to execute millions of simulations every year, which involves a lot of compute power. To run these simulations at a lower cost, they decided to take advantage of unused compute capacity on the AWS cloud, by using Amazon EC2 Spot Instances, which also helped them to increase their capacity to run the simulations at this magnitude.

Next, let us understand supply chain optimization and its processes!

Supply chain optimization

Supply chains are worldwide networks of manufacturers, distributors, suppliers, logistics, and e-commerce retailers that work together to get products from the factory to the customer’s door without delays or damage. By enabling information flow through these networks, you can automate decisions without any human intervention. The key attributes to consider are real-time inventory forecasts, end-to-end visibility, and the ability to track and trace the entire production process with unparalleled efficiency. Your teams will no longer have to handle the minuscular details associated with supply chain decisions. With automation and ML you can resolve bottlenecks in product movement. For example, in the event of a pandemic or natural disaster, you can quickly divert goods to alternative shipping routes without affecting their on-time delivery.

Here are some examples of using ML to improve supply chain processes:

Demand Forecasting: You can combine a time series with additional correlated data, such as holidays, weather, and demographic events, and use deep learning models such as DeepAR to get more accurate results. This will help you to meet variable demand and avoid over-provisioning. Inventory Management: You can automate inventory management using ML models to determine stock levels and reduce costs by preventing excess inventory. Moreover, you can use ML models for anomaly detection in your supply chain processes, which can help you in optimizing inventory management, and deflect potential issues more proactively, for example, by transferring stock to the right location using optimized routing ahead of time. Boost Efficiency with Automated Product Quality Inspection: By using computer vision models, you can identify product defects faster with improved consistency and accuracy at an early stage so that customers receive high-quality products in a timely fashion. This reduces the number of customer returns and insurance claims that are filed due to issues in product quality, thus saving costs and time.

All the components of supply chain optimization discussed above need to work together as part of the workflow and therefore require low latency and high throughput in order to meet the goal of delivering an optimal quality product to a customer’s doorstep in a timely fashion. Using cloud services to build the workflow provides you with greater elasticity and scalability at an optimized cost. Moreover, with native purpose-built services, you can eliminate the heavy lifting and reduce the time to market.

Summary

In this chapter, we started by understanding HPC fundamentals and its importance in processing massive amounts of data to gain meaningful insights. We then discussed the limitations of running HPC workloads on-premises, as different types of HPC applications will have different hardware and software requirements, which becomes time-consuming and costly to procure in-house. Moreover, it will hinder innovation as developers and engineers are limited to the availability of resources instead of the application requirements. Then, we talked about how having HPC workloads on the cloud can help in overcoming these limitations and foster collaboration across global teams, break barriers to innovation, improve architecture design, and optimize performance and cost. Cloud infrastructure has made the specialized hardware needed for HPC applications more accessible, which has led to innovation in this space across a wide range of industries. Therefore, in the last section, we discussed some emerging workloads in HPC, such as in life sciences and healthcare, supply chain optimization, and AVs, along with real-world examples.

In the next chapter, we will dive into data management and transfer, which is the first step to running HPC workloads on the cloud.

Further reading

You can check out the following links for additional information regarding this chapter’s topics:

https://www.statista.com/statistics/871513/worldwide-data-created/https://d1.awsstatic.com/whitepapers/Intro_to_HPC_on_AWS.pdfhttps://aws.amazon.com/solutions/case-studies/novartis/https://aws.amazon.com/solutions/case-studies/Lyft-level-5-spot/https://d1.awsstatic.com/psc-digital/2021/gc-700/supply-chain-ebook/Supply-Chain-eBook.pdfhttps://d1.awsstatic.com/HCLS%20Whitepaper%20.pdf

2

Data Management and Transfer

In Chapter 1, High-Performance Computing Fundamentals, we introduced the concepts of HPC applications, why we need HPC, and its use cases across different industries. Before we begin developing HPC applications, we need to migrate the required data into the cloud. In this chapter, we will uncover some of the challenges in managing and transferring data to the cloud and the ways to mitigate them. We will dive deeper into the services in AWS online and offline data transfer services using which you can securely transfer data to the AWS cloud, while maintaining data integrity and consistency. We will cover different data transfer scenarios and provide guidance on how to select the right service for each one.

We will cover the following topics in this chapter:

Importance of data management Challenges of moving data into the cloud How to securely transfer large amounts of data into the cloudAWS online data transfer services AWS offline data transfer services

These topics will help you understand how you can transfer Gigabytes (GB), Terabytes (TB), or Petabytes (PB) of data onto the cloud with minimal disruption, cost, and time involved.

Let’s get started with data management and its role in HPC applications.

Importance of data management

Data management is the process of effectively capturing, storing, and collating data created by different applications in your company to make sure it’s accurate, consistent, and available when needed. It includes developing policies and procedures for managing your end-to-end data life cycle. The following are some of the elements of the data life cycle specific to HPC applications, due to which it’s important to have data management policies in place:

Cleaning and transforming raw data to perform detailed faultless analysis.Designing and building data pipelines to automatically transfer data from one system to another. Extracting, Transforming, and Loading (ETL) data into appropriate data storage systems such as databases, data warehouses, and object storage or filesystems from disparate data sources. Building data catalogs for storing metadata to make it easier to find and track the data lineage. Following policies and procedures as outlined by your data governance model. This also involves conforming to the compliance requirements of the federal and regional authorities of the country where data is being captured and stored. For example, if you are a healthcare organization in California, United States, you would need to follow both federal and state data privacy laws, including the Health Insurance Portability and Accountability Act (HIPAA) and California’s health data privacy law, the Confidentiality of Medical Information Act (CMIA). Additionally, you would also need to follow the California Consumer Privacy Act (CCPA), which came into effect starting January 1, 2020, as it relates to healthcare data. If you are in Europe, you would have to follow the data guidelines governed by the European Union’s General Data Protection Regulation (GDPR). Protecting your data from unauthorized access, while at rest or in transit.

Now that we have understood the significance of data management in HPC applications, let’s see some of the challenges of transferring large amounts of data into the cloud.

Challenges of moving data into the cloud

In order to start building HPC applications on the cloud, you need to have data on the cloud, and also think about the various elements of your data life cycle in order to be able to manage data effectively. One way is to write custom code for transferring data, which will be time-consuming and might involve the following challenges:

Preserving the permissions and metadata of files. Making sure that data transfer does not impact other existing applications in terms of performance, availability, and scalability, especially in the case of online data transfer (transferring data over the network). Scheduling data transfer for non-business hours to ensure other applications are not impeded.In terms of structured data, you might have to think about schema conversion and database migration. Maintaining data integrity and validating the transfer. Monitoring the status of the data transfer, having the ability to look up the history of previous transfers, and having a retry mechanism in place to ensure successful transfers. Making sure there are no duplicates – once the data has been transferred, the system should not trigger the transfer again. Protecting data during the transfer, which will include encrypting data both in transit and at rest. Ensuring data arrives intact and is not corrupted. You would need a mechanism to check that the data arriving at the destination matches the data read from the source to validate data consistency. Last but not least, you would have to manage, version-control, and optimize your data-copying scripts.

The data transfer and migration services offered by AWS can assist you in securely transferring data to the cloud without you having to write and manage code, helping you overcome these aforementioned challenges. In order to select the right service based on your business requirement, you first need to build a data transfer strategy. We will discuss the AWS data transfer services in a subsequent section of this chapter. Let’s first understand the items that you need to consider while building your strategy.

In a nutshell, your data transfer strategy needs to take the following into account in order to move data with minimal disruption, time, and cost:

What kind of data do you need for developing your HPC application – for example, structured data, unstructured data (such as images and PDF documents), or a combination of both? For unstructured data, which filesystem do you use for storing your files currently? Is it on Network Attached Storage (NAS) or Storage Area Network (SAN)?How much purchased storage is available right now, and for how long will it last based on the rate of growth of your data, before you plan to buy more storage? For structured data, which database do you use? Are you tied up in database licenses? If yes, when are they due for renewal, and what are the costs of the licenses? What is the volume of data that you need to transfer to the cloud? What are the other applications that are using this data? Do these applications require local access to data? Will there be any performance impact on the existing applications if the data is moved to the cloud? What is your network bandwidth? Is it good enough to transfer data over the network? How quickly do you need to move your data to the cloud?

Based on the answers to these questions, you can create your data strategy and select appropriate AWS services that will help you to transfer data with ease and mitigate the challenges mentioned in the preceding list. To understand it better, let’s move to the next topic and see how to securely transfer large amounts of data into the cloud with a simple example.

How to securely transfer large amounts of data into the cloud

To understand this topic, let’s start with a simple example where you want to build and train a computer vision deep learning model to detect product defects in your manufacturing production line. You have cameras installed on each production line, which capture hundreds of images each day. Each image can be up to 5 MB in size, and you have about 1 TB of data, which is currently stored on-premises in a NAS filesystem that you want to use to train your machine learning model. You have about 1 Gbps of network bandwidth and need to start training your model in 2-4 weeks. There is no impact on other applications if the data is moved to the cloud and no structured data is needed for building the computer vision model. Let’s rearrange this information into the following structure, which will become part of your data strategy document:

Objective: To transfer 1 TB of image data to the cloud, where the file size can be up to 5 MB. Need to automate the data transfer to copy about 10 GB of images every night to the cloud. Additionally, need to preserve the metadata and file permissions while copying data to the cloud.Timeline: 2-4 weeksData type: Unstructured data – JPG or PNG format image filesDependency: NoneImpact on existing applications: NoneNetwork bandwidth: 1 GbpsExisting storage type: Network attached storagePurpose of data transfer: To perform distributed training on a computer vision deep learning model using multiple GPUsData destination: Amazon S3, which is secure, durable, and the most cost-effective object storage on AWS for storing large amounts of dataSensitive data: None, but data should not be available for public accessLocal data access: Not required

Since you have 5 TB of data with a maximum file size of 5 MB to transfer securely to Amazon S3, you can use the AWS DataSync service. It is an AWS online data transfer service to migrate data securely using a Virtual Private Cloud (VPC) endpoint to avoid your data going through the open internet. We will discuss all the AWS data transfer services in detail in the later sections of this chapter.

The following architecture visually depicts how the transfer will take place:

Figure 2.1 – Data transfer using AWS DataSync with a VPC endpoint

The AWS DataSync agent transfers the data between your local storage, NAS in this case, and AWS. You deploy the agent in a Virtual Machine (VM) in your on-premises network, where your data source resides. With this approach, you can minimize the network overhead while transferring data using the Network File System (NFS) and Server Message Block (SMB) protocols.

Let’s take a deeper look into AWS DataSync in the next section.

AWS online data transfer services

Online data transfer services are out-of-the-box solutions built by AWS for transferring data between on-premises systems and the AWS cloud via the internet. They include the following services:

AWS DataSyncAWS Transfer FamilyAmazon S3 Transfer AccelerationAmazon Kinesis AWS Snowcone

Let’s look at each of these services in detail to understand the scenarios in which we can use the relevant services.

AWS DataSync

AWS DataSync helps you overcome the challenges of transferring data from on-premises to AWS storage services and between AWS storage services in a fast and secure fashion. It also enables you to automate or schedule the data transfer to optimize your use of network bandwidth, which might be shared with other applications. You can monitor the data transfer task, add data integrity checks to make sure that the data transfer was successful, and validate that data was not corrupted during the transfer, while preserving the file permissions and associated metadata. DataSync offers integration with multiple filesystems and enables you to transfer data between the following resources:

On-premises file servers and object storage:NFS file serversSMB file serversHadoop Distributed File System (HDFS)Self-managed object storageAWS storage services:Snow Family DevicesAmazon Simple Storage Service (Amazon S3) bucketsAmazon Elastic File System (EFS)Amazon FSx for Windows File ServerAmazon FSx for Lustre filesystems

Important note

We will discuss AWS storage services in detail in Chapter 4, Data Storage.

Use cases

As discussed, AWS DataSync is used for transferring data to the cloud over the network. Let’s now see some of the specific use cases for which you can use DataSync:

Hybrid cloud workloads, where data is generated by on-premises applications and needs to be moved to and from the AWS cloud for processing. This can include HPC applications in healthcare, manufacturing, life sciences, big data analytics in financial services, and research purposes. Migrate data rapidly over the network into AWS storage services such as Amazon S3, where you need to make sure that data arrives securely and completely. DataSync has encryption and data integrity during transfer enabled by default. You can also choose to enable additional data verification checks to compare the source and destination data. Data archiving, where you want to archive the infrequently accessed data (cold data) directly into durable and long-term storage in the AWS cloud, such as Amazon S3 Glacier or S3 Glacier Deep Archive. This helps you to free up your on-premises storage capacity and reduce costs. Scheduling a data transfer job to automatically start on a recurring basis at a particular time of the day to optimize network bandwidth usage, which might be shared with other applications. For example, in the life sciences domain, you may want to upload genomic data generated by on-premises applications for processing and training machine learning models on a daily basis. You can both schedule data transfer tasks and monitor them as required using DataSync.

Workings of AWS DataSync

We will use an architecture diagram to show how DataSync can transfer data between on-premises self-managed storage systems to AWS storage services and between AWS storage resources.

We will start with on-premises storage to AWS storage services.

Data transfer from on-premises to AWS storage services

The architecture in Figure 2.2 depicts the data transfer from on-premises to AWS storage resources:

Figure 2.2 – Data transfer from on-premises to AWS storage services using AWS DataSync

The DataSync agent