E-Book
35,99 €

Pretrain Vision and Large Language Models in Python E-Book

Emily Webber

0,0

35,99 €

Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.
Mehr erfahren.

Herausgeber: Packt Publishing
Kategorie: Wissenschaft und neue Technologien
Sprache: Englisch

Beschreibung

Master the art of training vision and large language models with conceptual fundaments and industry-expert guidance. Learn about AWS services and design patterns, with relevant coding examples

Key Features

Learn to develop, train, tune, and apply foundation models with optimized end-to-end pipelines

Explore large-scale distributed training for models and datasets with AWS and SageMaker examples

Evaluate, deploy, and operationalize your custom models with bias detection and pipeline monitoring

Book Description

Foundation models have forever changed machine learning. From BERT to ChatGPT, CLIP to Stable Diffusion, when billions of parameters are combined with large datasets and hundreds to thousands of GPUs, the result is nothing short of record-breaking. The recommendations, advice, and code samples in this book will help you pretrain and fine-tune your own foundation models from scratch on AWS and Amazon SageMaker, while applying them to hundreds of use cases across your organization.

With advice from seasoned AWS and machine learning expert Emily Webber, this book helps you learn everything you need to go from project ideation to dataset preparation, training, evaluation, and deployment for large language, vision, and multimodal models. With step-by-step explanations of essential concepts and practical examples, you’ll go from mastering the concept of pretraining to preparing your dataset and model, configuring your environment, training, fine-tuning, evaluating, deploying, and optimizing your foundation models.

You will learn how to apply the scaling laws to distributing your model and dataset over multiple GPUs, remove bias, achieve high throughput, and build deployment pipelines.

By the end of this book, you’ll be well equipped to embark on your own project to pretrain and fine-tune the foundation models of the future.

What you will learn

Find the right use cases and datasets for pretraining and fine-tuning

Prepare for large-scale training with custom accelerators and GPUs

Configure environments on AWS and SageMaker to maximize performance

Select hyperparameters based on your model and constraints

Distribute your model and dataset using many types of parallelism

Avoid pitfalls with job restarts, intermittent health checks, and more

Evaluate your model with quantitative and qualitative insights

Deploy your models with runtime improvements and monitoring pipelines

Who this book is for

If you’re a machine learning researcher or enthusiast who wants to start a foundation modelling project, this book is for you. Applied scientists, data scientists, machine learning engineers, solution architects, product managers, and students will all benefit from this book. Intermediate Python is a must, along with introductory concepts of cloud computing. A strong understanding of deep learning fundamentals is needed, while advanced topics will be explained. The content covers advanced machine learning and cloud techniques, explaining them in an actionable, easy-to-understand way.

Details

Das E-Book können Sie in Legimi-Apps oder einer beliebigen App lesen, die das folgende Format unterstützen:

EPUB

Seitenzahl: 425

Veröffentlichungsjahr: 2023

Bewertungen

0,0

Rezensionen(0 Rezensionen)

Leseprobe

Pretrain Vision and Large Language Models in Python

End-to-end techniques for building and deploying foundation models on AWS

Emily Webber

BIRMINGHAM—MUMBAI

Pretrain Vision and Large Language Models in Python

All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing or its dealers and distributors, will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book.

Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

Group Product Manager: Ali Abidi

Publishing Product Manager: Dhruv Jagdish Kataria

Content Development Editor: Priyanka Soam

Technical Editor: Rahul Limbachiya

Copy Editor: Safis Editing

Project Coordinator: Farheen Fathima

Proofreader: Safis Editing

Indexer: Subalakshmi Govindhan

Production Designer: Ponraj Dhandapani

Marketing Coordinator: Shifa Ansari and Vinishka Kalra

First published: May 2023

Production reference: 1250523

Published by Packt Publishing Ltd.

Livery Place

35 Livery Street

Birmingham

B3 2PB, UK.

ISBN 978-1-80461-825-7

www.packtpub.com

To the beginner's mind in us all. May we all have the humility and the courage to face the unknown with curious minds.

- Emily Webber

Foreword

Welcome to the remarkable world of Machine Learning and foundation models! It has been rare that a new technology has taken the world by storm like these models have. These extraordinary creations have revolutionized the way we interact with technology and have opened up unprecedented opportunities for innovation and discovery.

In this book, penned with clarity and a genuine passion for the subject, Emily invites you on a journey. Whether you are a first-time user or someone seeking a fresh perspective on these fascinating tools and practical applications, you will find this book a valuable companion.

The field of ML can be daunting - complex algorithms, intricate mathematical formulas, and technical jargon. Yet Emily skillfully navigates through this intricacy, distilling the fundamentals into an accessible and easy-to-understand language. Emily’s writing style takes you not only through the "how" but also the "why" behind these advancements. The clear examples, and the grounding in the very practical platform offered by Amazon SageMaker, offer you the ability to follow along and learn-by-doing. The broad coverage of fields such as dataset preparation, pretraining, fine-tuning, deployment, bias detection and ML operations are covered in a rare soup-to-nuts, deep-yet-readable description.

Ultimately, this book is a testament to Emily's passion for sharing knowledge and empowering others. By the time you reach the final page, you will have gained a solid foundation in the world of large language models, armed with the confidence to embark on your own exciting experiments and projects.

So, without further ado, let us embark on this captivating journey through the world of large language models. Join Emily as she illuminates the way, inspiring you to embrace the power of these remarkable creations and harness their potential to reshape the future.

Thank you, my friend, for the time, effort and - ultimately - love you put into this.

Andrea Olgiati

Chief Engineer, Amazon SageMaker

Santa Clara, May 2023

Contributors

About the author

Emily Webber is a principal machine learning (ML) specialist solutions architect at Amazon Web Services (AWS). In her more than five years at AWS, she has assisted hundreds of customers on their journey to ML in the cloud, specializing in distributed training for large language and vision models. She mentors ML solution architects, authors countless feature designs for SageMaker and AWS, and guides the Amazon SageMaker product and engineering teams on best practices in regard to ML and customers. Emily is widely known in the AWS community for a 16-video YouTube series (https://www.youtube.com/playlistlist=PLhr1KZpdzukcOr_6j_zmSrvYnLUtgqsZz) featuring SageMaker with 211,000 views, and for giving a keynote speech at O’Reilly AI London 2019 on a novel reinforcement learning approach she developed for public policy.

Acknowledgment

It seems almost impossible to thank all of the talented, passionate, hardworking, and kind people who have helped me in my journey to this very moment. And yet I would be remiss if I didn't try!

To the Packt team, including Dhruv Kataria, Priyanka Soam, Aparna Ravikumar Nair, and Hemangi Lotlikar, thank you so much for all of your enthusiasm, careful checks, and for believing in me from the start. Ten years ago I would have never thought I'd be writing a book on artificial intelligence, and it truly took the whole village of us to pull it off.

There are quite literally hundreds of Amazonians I would like to thank for supporting me and the content of this book. This is due to services they've scoped and developed, customers they've onboarded and businesses they've built, design patterns they've developed, techniques they've perfected, the content they've built, and so much more. In particular, I'd like to call out Nav Bhasin, Mani Khanuja, Mark Roy, Shelbee Eigenbrode, Dhawal Patel, Kanwaljit Khurmi, Farooq Sabir, Sean Morgan, and all of the rest of the wonderful ML SA's I love working with every day. There are so many other teams at Amazon whose passion has helped create this work: from engineering to product, business development to marketing, all of my friends in the field, documentation, start-ups, public sector, and across the world, what an honor it is to make this dream come true with you.

To my customers, you truly bring everything to life! Whether we shared an immersion day, a workshop, a call, a talk, a blog post, or a Slack channel, it is such an honor to serve as your technology partner in so many. We truly do walk up every day thinking about how to optimize for your outcomes!

I spent quite a few days working on this book during the Winter Applications of Computer Vision (WACV) Conference in Hawaii, in January 2023. This was during a workshop I'd hosted https://sites.google.com/view/wacv2023-workshop with Fouad Bousetouane, Corey Barrett, Mani Khanuja, Larry Davis, and more. To all of my friends at WACV, it was such a perfect place to hash out ideas and misconceptions with you, thanks for all of your support!

Finally, I'd love to thank my loving husband and the rest of my family for all of their support throughout the winding road that's been my life. Who would have guessed that some crazy girl from Pennsylvania would be writing books on AI from Amazon? Not me!

To my readers, you make this all possible. I hope you'll consider reaching out to me and connecting! I'm live on Twitch every Monday, so you can always hop on for a question or a chat. For the future, stay tuned. I have a lot more words in me yet to write.

About the reviewer

Falk Pollok is a senior software engineer in IBM Research Europe and a senior RSE for the MIT-IBM Watson AI Lab, specializing in foundation models and multimodal question answering. Falk was a member of the MIT-Harvard-Stanford team on the Defense Advanced Research Projects Agency (DARPA) Machine Common Sense (MCS) project, contributed to IBM Watson Core, Orchestrate, and ML, was the lead developer for IBM Sapphire, and founded IBM’s Engineering Excellence program. He holds a master’s degree in computer science from RWTH Aachen, leadership certificates from Cornell, and IBM’s highest developer profession rank. Moreover, he has published eight papers in top conferences such as NeurIPS, AAAI, and Middleware, has two patents, was named a Face of IBM Research, and received multiple awards, including IBM’s OTA and InfoWorld’s Open Source Software Awards (BOSSIE) award.

Preface

Part 1: Before Pretraining

1 An Introduction to Pretraining Foundation Models

The art of pretraining and fine-tuning

The Transformer model architecture and self-attention

State-of-the-art vision and language models

Top vision models as of April 2023

Contrastive pretraining and natural language supervision

Top language models as of April 2023

Language technique spotlight – causal modeling and the scaling laws

Encoders and decoders

Summary

References

2 Dataset Preparation: Part One

Finding a dataset and use case for foundation modeling

Top pretraining use cases by industry

Delta – how different is your dataset?

Use the scaling laws to size your datasets

Fundamentals – scaling laws of neural language models

Bias detection and mitigation

Enhancing your dataset – multilingual, multimodal, and augmentations

Summary

References

3 Model Preparation

Finding your best base model

Starting with the smallest base model you can

Trade-off – simplicity versus complexity

Finding your pretraining loss function

Pretraining loss functions in vision – ViT and CoCa

Pretraining loss functions in language – Alexa Teacher Model

Changing your pretraining loss function

Solving for your model size

Practical approaches to solving for your model size

Not all scaling laws are created equal

Planning future experiments

Summary

References

Part 2: Configure Your Environment

4 Containers and Accelerators on the Cloud

What are accelerators and why do they matter?

Getting ready to use your accelerators

How to use accelerators on AWS – Amazon SageMaker

Optimizing accelerator performance

Hyperparameters

Infrastructure optimizations for accelerators on AWS

Troubleshooting accelerator performance

Summary

References

5 Distribution Fundamentals

Understanding key concepts – data and model parallelism

What data parallel is all about

What model parallel is all about

Combining model and data parallel

Distributed training on Amazon SageMaker

Distributed training software

SM DDP

SMP library

Advanced techniques to reduce GPU memory

Tensor parallelism

Optimizer state sharding

Activation checkpointing

Sharded data parallelism

Bringing it all home with examples from models today

Stable Diffusion – data parallelism at scale

GPT-3 – model and data parallelism at scale

Summary

References

6 Dataset Preparation: Part Two, the Data Loader

Introducing the data loader in Python

Building and testing your own data loader – a case study from Stable Diffusion

Creating embeddings – tokenizers and other key steps for smart features

Optimizing your data pipeline on Amazon SageMaker

Transforming deep learning datasets at scale on AWS

Summary

References

Part 3: Train Your Model

7 Finding the Right Hyperparameters

Hyperparameters – batch size, learning rate, and more

Key hyperparameters in vision and language

Tuning strategies

Hyperparameter tuning for foundation models

Scaling up as a function of world size with SageMaker

Tuning on a sample of your data and updating based on world size

Summary

References

8 Large-Scale Training on SageMaker

Optimizing your script for SageMaker training

Importing packages

Argument parsing

Top usability features for SageMaker training

Warm pools for rapid experimentation

SSM and SSH into training instances

Track jobs and experiments to replicate results

Summary

References

9 Advanced Training Concepts

Evaluating and improving throughput

Calculating model TFLOPS

Using Flash Attention to speed up your training runs

Speeding up your jobs with compilation

Integrating compilation into your PyTorch scripts

Amazon SageMaker Training Compiler and Neo

Best practices for compilation

Running compiled models on Amazon’s Trainium and Inferentia custom hardware

Solving for an optimal training time

Summary

References

Part 4: Evaluate Your Model

10 Fine-Tuning and Evaluating

Fine-tuning for language, text, and everything in between

Fine-tuning a language-only model

Fine-tuning vision-only models

Fine-tuning vision-language models

Evaluating foundation models

Model evaluation metrics for vision

Model evaluation metrics in language

Model evaluation metrics in joint vision-language tasks

Incorporating the human perspective with labeling through SageMaker Ground Truth

Reinforcement learning from human feedback

Summary

References

11 Detecting, Mitigating, and Monitoring Bias

Detecting bias in ML models

Detecting bias in large vision and language models

Mitigating bias in vision and language models

Bias mitigation in language – counterfactual data augmentation and fair loss functions

Bias mitigation in vision – reducing correlation dependencies and solving sampling issues

Monitoring bias in ML models

Detecting, mitigating, and monitoring bias with SageMaker Clarify

Summary

References

12 How to Deploy Your Model

What is model deployment?

What is the best way to host my model?

Model deployment options on AWS with SageMaker

Why should I shrink my model, and how?

Model compilation

Knowledge distillation

Quantization

Hosting distributed models on SageMaker

Model servers and end-to-end hosting optimizations

Summary

References

Part 5: Deploy Your Model

13 Prompt Engineering

Prompt engineering – the art of getting more with less

From few- to zero-shot learning

Text-to-image prompt engineering tips

Image-to-image prompt engineering tips

Upscaling

Masking

Prompting for object-to-image with DreamBooth

Prompting large language models

Instruction fine-tuning

Chain-of-thought prompting

Summarization

Defending against prompt injections and jailbreaking

Advanced techniques – prefix and prompt tuning

Prefix tuning

Prompt tuning

Summary

References

14 MLOps for Vision and Language

What is MLOps?

Common MLOps pipelines

Continuous integration and continuous deployment

Model monitoring and human-in-the-loop

MLOps for foundation models

MLOps for vision

AWS offerings for MLOps

A quick introduction to SageMaker Pipelines

Summary

References

15 Future Trends in Pretraining Foundation Models

Techniques for building applications for LLMs

Building interactive dialogue apps with open-source stacks

Using RAG to ensure high accuracy in LLM applications

Is generation the new classification?

Human-centered design for building applications with LLMs

Other generative modalities

AWS offerings in foundation models

The future of foundation models

The future of pretraining

Summary

References

Index

Other Books You May Enjoy

Preface

So, you want to work with foundation models? That is an excellent place to begin! Many of us in the machine learning community have followed these curious creatures for years, from their earliest onset in the first days of the Transformer models, to their expansion in computer vision, to the near ubiquitous presence of text generation and interactive dialogue we see in the world today.

But where do foundation models come from? How do they work? What makes them tick, and when should you pretrain and fine-tune them? How can you eke out performance gains on your datasets and applications? How many accelerators do you need? What does an end-to-end application look like, and how can you use foundation models to master this new surge of interest in generative AI?

These pages hope to provide answers to these very important questions. As you are no doubt aware, the pace of innovation in this space is truly breathtaking, with more foundation models coming online every day from both open-source and proprietary model vendors. To grapple with this reality, I’ve tried to focus on the most important conceptual fundamentals throughout the book. This means your careful study here should pay off for at least a few more years ahead.

In terms of practical applications and guidance, I’ve overwhelmingly focused on cloud computing options available through AWS and especially Amazon SageMaker. I’ve spent more than the last five years very happily at AWS and enjoy sharing all of my knowledge and experience with you! Please do note that all thoughts and opinions shared in this book are my own, and do not represent those of Amazon’s.

The following chapters focus on concepts, not code. This is because software changes rapidly, while fundamentals change very slowly. You’ll find in the repository with the book links to my go-to resources for all of the key topics mentioned throughout these fifteen chapters, which you can use right away to get hands-on with everything you’re learning here. Starting July 1, 2023, you’ll also find in the repository a set of new pretraining and fine-tuning examples from yours truly to complete all of the topics.

You might find this hard to believe, but in my early twenties I wasn’t actually coding: I was exploring the life of a Buddhist monastic. I spent five years living at a meditation retreat center in Arizona, the Garchen Institute. During this time, I learned how to meditate, focus my mind, watch my emotions and develop virtuous habits. After my master’s degree at the University of Chicago years later, and now at Amazon, I can see that these traits are extremely useful in today’s world as well!

I mention this so that you can take heart. Machine learning, artificial intelligence, cloud computing, economics, application development, none of these topics are straightforward. But if you apply yourself, if you really stretch your mind to consider the core foundations of the topics at hand, if you keep yourself coming back to the challenge again and again, there’s truly nothing you can’t do. That is the beauty of humanity! And if a meditating yogi straight from the deep silence of a retreat hut can eventually learn what it takes to pretrain and fine-tune foundation models, then so can you!

With that in mind, let’s learn more about the book itself!

Note

Most of the concepts mentioned here will be accompanied by scripting examples in the repository starting July 1, 2023. However, to get you started even earlier, you can find a list of resources in the repository today with links to useful hands-on examples elsewhere for demonstration.

Who is this book for?

What this book covers

Chapter 1, An Introduction to Pretraining Foundation Models In this chapter you’ll be introduced to foundation models, the backbone of many artificial intelligence and machine learning systems today. We will dive into their creation process, also called pretraining, and understand where it’s competitive to improve the accuracy of your models. We will discuss the core transformer architecture underpinning state of the art models like Stable Diffusion, BERT, Vision Transformers, CLIP, Flan-T5 and more. You will learn about the encoder and decoder frameworks that work to solve a variety of use cases.

Chapter 2, Dataset Preparation: Part One In this chapter, we begin to discuss what you’ll need in your dataset to start a meaningful pretraining project. This is the first of two parts on dataset preparation. It opens with some business guidance on finding a good use case for foundation modeling, where the data become instrumental. Then, focusing on the content of your dataset, we use qualitative and quantitative measures to compare it with datasets used in pretraining other top models. You’ll learn how to use the scaling laws to determine if your datasets are “large enough” and “good enough” to boost accuracy while pretraining. We discuss bias identification and mitigation, along with multilingual and multimodal solutions.

Chapter 3, Model Preparation In this chapter you’ll learn how to pick which model will be most useful to serve as a basis for your pretraining regime. You’ll learn how to think about the size of the model in parameters, along with the key loss functions and how they determine performance in production. You’ll combine the scaling laws with your expected dataset size to select ceiling and basement model sizes, which you’ll use to guide your experiments.

Chapter 4, Containers and Accelerators on the Cloud In this chapter, you’ll learn how to containerize your scripts and optimize them for accelerators on the cloud. We’ll learn about a range of accelerators for foundation models, including tradeoffs around cost and performance across the entire machine learning lifecycle. You’ll learn key aspects of Amazon SageMaker and AWS to train models on accelerators, optimize performance, and troubleshoot common issues. If you’re already familiar with using accelerators on AWS, feel free to skip this chapter.

Chapter 5, Distribution Fundamentals In this chapter, you’ll learn conceptual fundamentals for the distribution techniques you need to employ for large scale pretraining and fine-tuning. First, you’ll master top distribution concepts for machine learning, notably model and data parallel. Then, you’ll learn how Amazon SageMaker integrates with distribution software to run your job on as many GPUs as you need. You’ll learn how to optimize model and data parallel for large-scale training especially with techniques like sharded data parallelism. Then, you’ll learn how to reduce your memory consumption with advanced techniques like optimizer state sharding, activation checkpointing, compilation, and more. Lastly, we’ll look at a few examples across language, vision, and more to bring all of these concepts together.

Chapter 6, Dataset Preparation: Part Two, the Data Loader In this chapter, you’ll learn how to prepare your dataset to immediately use with your chosen models. You’ll master the concept of a data loader, knowing why it’s a common source of error in training large models. You’ll learn about creating embeddings, using tokenizers and other methods to featurize your raw data for your preferred neural network. Following these steps, you’ll be able to prepare your entire dataset, using methods for both vision and language. Finally, you’ll learn about data optimizations on AWS and Amazon SageMaker to efficiently send datasets large and small to your training cluster. Throughout this chapter we’ll work backwards from the training loop; giving you incrementally all the steps you need to have functional deep neural networks training at scale. You’ll also follow a case study from how I trained on 10TB for Stable Diffusion on SageMaker!

Chapter 7, Finding the Right Hyperparameters In this chapter, you’ll dive into the key hyperparameters that govern performance for top vision and language models, such as batch size, learning rate, and more. First, we’ll start with a quick overview of hyperparameter tuning for those who are new or new a light refresh, including key examples in vision and language. Then we’ll explore hyperparameter tuning in foundation models, both what is possible today and where trends might emerge. Finally, we’ll learn how to do this on Amazon SageMaker, taking incremental steps in a cluster size and changing each hyperparameter as we do.

Chapter 8, Large-Scale Training on SageMaker In this chapter, we cover key features and functionality available with Amazon SageMaker for running highly optimized distributed training. You’ll learn how to optimize your script for SageMaker training, along with key usability features. You’ll also learn about backend optimizations for distributed training with SageMaker, like GPU health checks, resilient training, checkpointing, script mode, and more.

Chapter 9, Advanced Training Concepts In this chapter, we will cover advanced training concepts at scale, like evaluating throughput, calculating model TFLOPS per device, compilation, and using the scaling laws to determine the right length of training time. In the last chapter you learned about how to do large-scale training on SageMaker generally speaking. In this chapter you’ll learn about particularly complex and sophisticated techniques you can use to drive down the overall cost of your job. This lower cost directly translates to higher model performance, because it means you can train for longer on the same budget.

Chapter 10, Fine-Tuning and Evaluating In this chapter, you’ll learn how to fine-tune your model on use case-specific datasets, comparing its performance to that of off-the-shelf public models. You should be able to see quantitative and qualitative boost from your pretraining regime. You’ll dive into some examples from language, text, and everything in-between. You’ll also learn how to think about and design a human-in-the-loop evaluation system, including the same RLHF that makes ChatGPT tick! This chapter focuses on updating the trainable weights of the model. For techniques that mimic learning but don’t update the weights, such as prompt tuning and standard retrieval augmented generation, see Chapter 13 on prompt engineering or Chapter 15 on future trends.

Chapter 11, Detecting, Mitigating, and Monitoring Bias In this chapter, we’ll analyze leading bias identification and mitigation strategies for large vision, language, and multimodal models. You’ll learn about the concept of bias, both in a statistical sense and how it impacts human beings in critical ways. You’ll understand key ways to quantify and remedy this in vision and language models, eventually landing on monitoring strategies that enable you to reduce any and all forms of harm when applying your foundation models.

Chapter 12, How to Deploy Your Model In this chapter, we’ll introduce you to a variety of techniques for deploying your model, including real-time endpoints, serverless, batch options and more. These concepts apply to many compute environments, but we’ll focus on capabilities available on AWS within Amazon SageMaker. We’ll talk about why you should try to shrink the size of your model before deploying, along with techniques for this across vision and language. We’ll also cover distributed hosting techniques, for scenarios when you can’t or don’t need to shrink your model. Lastly, we’ll explore model serving techniques and concepts that can help you optimize the end-to-end performance of your model.

Chapter 13, Prompt Engineering In this chapter, we’ll dive into a special set of techniques called prompt engineering. You’ll learn about this technique at a high level, including how it is similar to and different from other learning-based topics throughout this book. We’ll explore examples across vision and language, and dive into key terms and success metrics. In particular this chapter covers all of the tips and tricks for improving performance without updating the model weights. This means we’ll be mimicking the learning process, without necessarily changing any of the model parameters. This includes some advanced techniques like prompt and prefix tuning.

Chapter 14, MLOps for Vision and Language In this chapter, we’ll introduce core concepts of operations and orchestration for machine learning, also known as MLOps. This includes building pipelines, continuous integration and deployment, promotion through environments, and more. We’ll explore options for monitoring and human-in-the-loop auditing of model predictions. We’ll also identify unique ways to support large vision and language models in your MLOps pipelines.

Chapter 15, Future Trends in Pretraining Foundation Models In this chapter, we’ll close out the book by pointing to where trends are headed for all relevant topics presented in this book. We’ll explore trends in foundation model application development, like using LangChain to build interactive dialogue applications, along with techniques like retrieval augmented generation to reduce LLM hallucination. We’ll explore ways to use generative models to solve classification tasks, human-centered design, and other generative modalities like code, music, product documentation, PowerPoints, and more! We’ll talk through AWS offerings like SageMaker JumpStart Foundation Models, Amazon Bedrock, Amazon Titan, and Amazon Code Whisperer, and top trends in the future of foundation models and pretraining itself.

To get the most out of this book

As mentioned earlier, you want to be very happy in Python development to absolutely maximize your time in this book. The pages don’t spend a lot of time focusing on the software, but again, everything in the GitHub repository is Python. If you’re already using a few key AWS services, like Amazon SageMaker, S3 buckets, ECR images, and FSx for Lustre, that will speed you up tremendously in applying what you’ve learned here. If you’re new to these, that’s ok, we’ll include introductions to each of these.

AWS Service or Open-source software framework

What we’re using it for

Amazon SageMaker

Studio, notebook instances, training jobs, endpoints, pipelines

S3 buckets

Storing objects and retrieving metadata

Elastic Container Registry

Storing Docker images

FSx for Lustre

Storing large-scale data for model training loops

Python

General scripting: including managing and interacting with services, importing other packages, cleaning your data, defining your model training and evaluation loops, etc

PyTorch and TensorFlow

Deep learning frameworks to define your neural networks

Hugging Face

Hub with more than 100,000 open-source pretrained models and countless extremely useful and reliable methods for NLP and increasingly CV

Pandas

Go-to library for data analysis

Docker

Open-source framework for building and managing containers

If you are using the digital version of this book, we advise you to access the code from the book’s GitHub repository (a link is available in the next section), step through the examples, and type the code yourself. Doing so will help you avoid any potential errors related to the copying and pasting of code.

Download the example code files

You can download the example code files for this book from GitHub at https://github.com/PacktPublishing/Pretrain-Vision-and-Large-Language-Models-in-Python. If there’s an update to the code, it will be updated in the GitHub repository.

We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/. Check them out!

Conventions used

There are a number of text conventions used throughout this book.

Code in text: Indicates code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles. Here is an example: “Mount the downloaded WebStorm-10*.dmg disk image file as another disk in your system.”

A block of code is set as follows:

html, body, #map { height: 100%; margin: 0; padding: 0 }

Bold: Indicates a new term, an important word, or words that you see onscreen. For instance, words in menus or dialog boxes appear in bold. Here is an example: “Select System info from the Administration panel.”

Tips or important notes

Appear like this.

Get in touch

Feedback from our readers is always welcome.

General feedback: If you have questions about any aspect of this book, email us at [email protected] and mention the book title in the subject of your message.

Errata: Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you have found a mistake in this book, we would be grateful if you would report this to us. Please visit www.packtpub.com/support/errata and fill in the form.

Piracy: If you come across any illegal copies of our works in any form on the internet, we would be grateful if you would provide us with the location address or website name. Please contact us at [email protected] with a link to the material.

If you are interested in becoming an author: If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, please visit authors.packtpub.com.

Share Your Thoughts

Once you’ve read Pretrain Vision and Large Language Models in Python, we’d love to hear your thoughts! Please click here to go straight to the Amazon review page for this book and share your feedback.

Your review is important to us and the tech community and will help us make sure we’re delivering excellent quality content.

Download a free PDF copy of this book

Thanks for purchasing this book!

Do you like to read on the go but are unable to carry your print books everywhere?

Is your eBook purchase not compatible with the device of your choice?

Don’t worry, now with every Packt book you get a DRM-free PDF version of that book at no cost.

Read anywhere, any place, on any device. Search, copy, and paste code from your favorite technical books directly into your application.

The perks don’t stop there, you can get exclusive access to discounts, newsletters, and great free content in your inbox daily

Follow these simple steps to get the benefits:

Scan the QR code or visit the link below

https://packt.link/free-ebook/9781804618257

Submit your proof of purchaseThat’s it! We’ll send your free PDF and other benefits to your email directly

Part 1: Before Pretraining

In part 1, you’ll learn how to get ready to pretrain a large vision and/or language model, including dataset and model preparation.

This section has the following chapters:

Chapter 1, An Introduction to Pretraining Foundation ModelsChapter 2, Dataset Preparation: Part OneChapter 3, Model Preparation

1 An Introduction to Pretraining Foundation Models

The biggest lesson that can be read from 70 years of AI research is that general methods that leverage computation are ultimately the most effective, and by a large margin … The only thing that matters in the long run is the leveraging of computation.

– Richard Sutton, “The Bitter Lesson,” 2019 (1)

In this chapter, you’ll be introduced to foundation models, the backbone of many artificial intelligence and machine learning systems today. In particular, we will dive into their creation process, also called pretraining, and understand where it’s competitive to improve the accuracy of your models. We will discuss the core transformer architecture underpinning state-of-the-art models such as Stable Diffusion, BERT, Vision Transformers, OpenChatKit, CLIP, Flan-T5, and more. You will learn about the encoder and decoder frameworks, which work to solve a variety of use cases.

In this chapter, we will cover the following topics:

The art of pretraining and fine-tuningThe Transformer model architectureState-of-the-art vision and language modelsEncoders and decoders

The art of pretraining and fine-tuning

Humanity is one of Earth’s most interesting creatures. We are capable of producing the greatest of beauty and asking the most profound questions, and yet fundamental aspects about us are, in many cases, largely unknown. What exactly is consciousness? What is the human mind, and where does it reside? What does it mean to be human, and how do humans learn?

While scientists, artists, and thinkers from countless disciplines grapple with these complex questions, the field of computation marches forward to replicate (and in some cases, surpass) human intelligence. Today, applications from self-driving cars to writing screenplays, search engines, and question-answering systems have one thing in common – they all use a model, and sometimes many different kinds of models. Where do these models come from, how do they acquire intelligence, and what steps can we take to apply them for maximum impact? Foundation models are essentially compact representations of massive sets of data. The representation comes about through applying a pretraining objective onto the dataset, from predicting masked tokens to completing sentences. Foundation models are useful because once they have been created, through the process called pretraining, they can either be deployed directly or fine-tuned for a downstream task. An example of a foundation model deployed directly is Stable Diffusion, which was pretrained on billions of image-text pairs and generates useful images from text immediately after pretraining. An example of a fine-tuned foundation model is BERT, which was pretrained on large language datasets, but is most useful when adapted for a downstream domain, such as classification.

When applied in natural language processing, these models can complete sentences, classify text into different categories, produce summarizations, answer questions, do basic math, and generate creative artifacts such as poems and titles. In computer vision, foundation models are useful everywhere from image classification to generation, pose estimation to object detection, pixel mapping, and more.

This comes about because of defining a pretraining objective, which we’ll learn about in detail in this book. We’ll also cover its peer method, fine-tuning, which helps the model learn more about a specific domain. This more generally falls under the category of transfer learning, the practice of taking a pretrained neural network and supplying it with a novel dataset with the hope of enhancing its knowledge in a certain dimension. In both vision and language, these terms have some overlap and some clear distinctions, but don’t worry, we’ll cover them more throughout the chapters. I’m using the term fine-tuning to include the whole set of techniques to adapt a model to another domain, outside of the one where it was trained, not in the narrow, classic sense of the term.

Fundamentals – pretraining objectives

The heart of large-scale pretraining revolves around this core concept. A pretraining objective is a method that leverages information readily available in the dataset without requiring extensive human labeling. Some pretraining objectives involve masking, providing a unique [MASK] token in place of certain words, and training the model to fill in those words. Others take a different route, using the left-hand side of a given text string to attempt to generate the right-hand side.

The training process happens through a forward pass, sending your raw training data through the neural network to produce some output word. The loss function then computes the difference between this predicted word and the one found in the data. This difference between the predicted values and the actual values then serves as the basis for the backward pass. The backward pass itself usually leverages a type of stochastic gradient descent to update the parameters of the neural network with respect to that same loss function, ensuring that, next time around, it’s more likely to get a lower loss function.

In the case of BERT(2), the pretraining objective is called a masked token loss. For generative textual models of the GPT (3) variety, the pretraining objective is called causal language loss. Another way of thinking about this entire process is self-supervised learning, utilizing content already available in a dataset to serve as a signal to the model. In computer vision, you’ll also see this referred to as a pretext task. More on state-of-the-art models in the sections ahead!

Personally, I think pretraining is one of the most exciting developments in machine learning research. Why? Because, as Richard Sutton suggests controversially at the start of the chapter, it’s computationally efficient. Using pretraining, you can build a model from massive troves of information available on the internet, then combine all of this knowledge using your own proprietary data and apply it to as many applications as you can dream of. On top of that, pretraining opens the door for tremendous collaboration across company, country, language, and domain lines. The industry is truly just getting started in developing, perfecting, and exploiting the pretraining paradigm.

We know that pretraining is interesting and effective, but where is it competitive in its own right? Pretraining your own model is useful when your own proprietary dataset is very large and different from common research datasets, and primarily unlabeled. Most of the models we will learn about in this book are trained on similar corpora – Wikipedia, social media, books, and popular internet sites. Many of them focus on the English language, and few of them consciously use the rich interaction between visual and textual data. Throughout the book, we will learn about the nuances and different advantages of selecting and perfecting your pretraining strategies.

If your business or research hypothesis hinges on non-standard natural languages, such as financial or legal terminology, non-English languages, or rich knowledge from another domain, you may want to consider pretraining your own model from scratch. The core question you want to ask yourself is, How valuable is an extra one percentage point of accuracy in my model? If you do not know the answer to this question, then I strongly recommend spending some time getting yourself to an answer. We will spend time discussing how to do this in Chapter 2. Once you can confidently say an increase in the accuracy of my model is worth at least a few hundred thousand dollars, and even possibly a few million, then you are ready to begin pretraining your own model.

Now that we have learned about foundation models, how they come about through a process called pretraining, and how to adapt them to a specific domain through fine-tuning, let’s learn more about the Transformer model architecture.

The Transformer model architecture and self-attention

The Transformer model, presented in the now-famous 2017 paper Attention is all you need, marked a turning point for the machine learning industry. This is primarily because it used an existing mathematical technique, self-attention, to solve problems in NLP related to sequences. The Transformer certainly wasn’t the first attempt at modeling sequences, previously, recurrent neural networks (RNNs) and even convolutional neural networks (CNNs) were popular in language.

However, the Transformer made headlines because its training cost was a small fraction of the existing techniques. This is because the Transformer is fundamentally easier to parallelize, due to its core self-attention process, than previous techniques. It also set new world records in machine translation. The original Transformer used both an encoder and decoder, techniques we will dive into later throughout this chapter. This joint encoder-decoder pattern was followed directly by other models focused on similar text-to-text tasks, such as T5.

In 2018, Alex Radford and his team presented Generative Pretrained Transformers, a method inspired by the 2017 Transformer, but using only the decoder. Called GPT, this model handled large-scale unsupervised pretraining well, and it was paired with supervised fine-tuning to perform well on downstream tasks. As we mentioned previously, this causal language modeling technique optimizes the log probability of tokens, giving us a left-to-right ability to find the most probable word in a sequence.

In 2019, Jacob Devlin and his team presented BERT: Pretraining of Deep Bidirectional Transformers. BERT also adopted the pretraining, fine-tuning paradigm, but implemented a masked language modeling loss function that helped the model learn the impact of tokens both before and after them. This proved useful in disambiguating the meaning of words in different contexts and has aided encoder-only tasks such as classification ever since.

Despite their names, neither GPT nor BERT uses the full encoder-decoder as presented in the original Transformer paper but instead leverages the self-attention mechanism as core steps throughout the learning process. Thus, it is in fact the self-attention process we should understand.

First, remember that each word, or token, is represented as an embedding. This embedding is created simply by using a tokenizer, a pretrained data object for each model that maps the word to its appropriate dense vector. Once we have the embedding per token, we use learnable weights to generate three new vectors: key, query, and value. We then use matrix multiplication and a few steps to interact with the key and the query, using the value at the very end to determine what was most informative in the sequence overall. Throughout the training loop, we update these weights to get better and better interactions, as determined by your pretraining objective.

Your pretraining objective serves as a directional guide for how to update the model parameters. Said another way, your pretraining objective provides the primary signal to your stochastic gradient descent updating procedure, changing the weights of your model based on how incorrect your model predictions are. When you train for long periods of time, the parameters should reflect a decrease in loss, giving you an overall increase in accuracy.

Interestingly, the type of transformer heads will change slightly based on the different types of pretraining objectives you’re using. For example, a normal self-attention block uses information from both the left- and right-hand sides of a token to predict it. This is to provide the most informative contextual information for the prediction and is useful in masked language modeling. In practice, the self-attention heads are stacked to operate on full matrices of embeddings, giving us multi-head attention. Casual language modeling, however, uses a different type of attention head: masked self-attention. This limits the scope of predictive information to only the left-hand side of the matrix, forcing the model to learn a left-to-right procedure. This is in contrast to the more traditional self-attention, which has access to both the left and right sides of the sequence to make predictions.

Most of the time, in practice, and certainly throughout this book, you won’t need to code any transformers or self-attention heads from scratch. Through this book, we will, however, be diving into many model architectures, so it’s helpful to have this conceptual knowledge as a base.

From an intuitive perspective, what you’ll need to understand about transformers and self-attention is fewfold:

The transformer itself is a model entirely built upon a self-attention function: The self-attention function takes a set of inputs, such as embeddings, and performs mathematical operations to combine these. When combined with token (word or subword) masking, the model can effectively learn how significant certain parts of the embeddings, or the sequence, are to the other parts. This is the meaning of self-attention; the model is trying to understand which parts of the input dataset are most relevant to the other parts.Transformers perform exceedingly well using sequences: Most of the benchmarks they’ve blown past in recent years are from NLP, for a good reason. The pretraining objectives for these include token masking and sequence completion, both of which rely on not just individual data points but the stringing of them together, and their combination. This is good news for those of you who already work with sequential data and an interesting challenge for those who don’t.Transformers operate very well at large scales: The underlying attention head is easily parallelizable, which gives it a strong leg-up in reference to other candidate sequence-based neural network architectures such as RNNs, including Long Short-Term Memory (LSTM) based networks. The self-attention head can be set to trainable in the case of pretraining, or untrainable in the case of fine-tuning. When attempting to actually train the self-attention heads, as we’ll do throughout this book, the best performance you’ll see is when the transformers are applied on large datasets. How large they need to be, and what trade-offs you can make when electing to fine-tune or pretrain, is the subject of future chapters.

Transformers are not the only means of pretraining. As we’ll see throughout the next section, there are many different types of models, particularly in vision and multimodal cases, which can deliver state-of-the-art performance.

State-of-the-art vision and language models

If you’re new to machine learning, then there is a key concept you will eventually want to learn how to master, that is, state of the art. As you are aware, there are many different types of machine learning tasks, such as object detection, semantic segmentation, pose detection, text classification, and question answering. For each of these, there are many different research datasets. Each of these datasets provides labels, frequently for train, test, and validation splits. The datasets tend to be hosted by academic institutions, and each of these is purpose-built to train machine learning models that solve each of these types of problems.

When releasing a new dataset, researchers will frequently also release a new model that has been trained on the train set, tuned on the validation set, and separately evaluated on the test set. Their evaluation score on a new test set establishes a new state of the art for this specific type of modeling problem. When publishing certain types of papers, researchers will frequently try to improve performance in this area – for example, by trying to increase accuracy by a few percentage points on a handful of datasets.

The reason state-of-the-art performance matters for you is that it is a strong indication of how well your model is likely to perform in the best possible scenario. It isn’t easy to replicate most research results, and frequently, labs will have developed special techniques to improve performance that may not be easily observed and replicated by others. This is especially true when datasets and code repositories aren’t shared publicly, as is the case with GPT-3. This is acutely true when training methods aren’t disclosed, as with GPT-4.

However, given sufficient resources, it is possible to achieve similar performance as reported in top papers. An excellent place to find state-of-the-art performance at any given point in time is an excellent website, Papers With Code, maintained by Meta and enhanced by the community. By using this free tool, you can easily find top papers, datasets, models, and GitHub sites with example code. Additionally, they have great historical views, so you can see how the top models in different datasets have evolved over time.

In later chapters on preparing datasets and picking models, we’ll go into more detail on how to find the right examples for you, including how to determine how similar to and different from your own goals they are. Later in the book, we’ll also help you determine the optimal models, and sizes for them. Right now, let’s look at some models that, as of this writing, are currently sitting at the top of their respective leaderboards.

Top vision models as of April 2023

First, let’s take a quick look at the models performing the best today within image tasks such as classification and generation.

Dataset

Best model

From Transformer

Performance

ImageNet

Basic-L (Lion fine-tuned)

Yes

91.10% top 1% accuracy

CIFAR-10

ViT-H/14 (1)

Yes

99.5% correct

COCO

InternImage-H (M3I Pre-training: https://paperswithcode.com/paper/internimage-exploring-large-scale-vision)

65.0 Box AP

STL-10

Diffusion ProjectedGAN

6.91 FID (generation)

ObjectNet

CoCa

Yes

82.7% top 1% accuracy

MNIST

Heterogeneous ensemble with simple CNN (1)

99.91% accuracy (0.09% error)

Table 1.1 – Top image results

At first glance, these numbers may seem intimidating. After all, many of them are near or close to 99% accurate! Isn’t that too high of a bar for beginning or intermediate machine learning practitioners?

Before we get too carried away with doubt and fear, it’s helpful to understand that most of these accuracy scores came at least five years after the research dataset was published. If we analyze the historical graphs available on Paper With Code, it’s easy to see that when the first researchers published their datasets, initial accuracy scores were closer to 60%. Then, it took many years of hard work, across diverse organizations and teams, to finally produce models capable of hitting the 90s. So, don’t lose heart! If you put in the time, you too can train a model that establishes a new state-of-the-art performance in a given area. This part is science, not magic.

You’ll notice that while some of these models do in fact adopt a Transformer-inspired backend, some do not. Upon closer inspection, you’ll also see that some of these models rely on the pretrain and fine-tune paradigm we’ll be learning about in this book, but not all of them. If you’re new to machine learning, then this discrepancy is something to start getting comfortable with! Robust and diverse scientific debate, perspectives, insights, and observations are critical aspects of maintaining healthy communities and increasing the quality of outcomes across the field as a whole. This means that you can, and should, expect some divergence in methods you come across, and that’s a good thing.

Now that you have a better understanding of top models in computer vision these days, let’s explore one of the earliest methods combining techniques from large language models with vision: contrastive pretraining and natural language supervision.

Contrastive pretraining and natural language supervision

What’s interesting about both modern and classic image datasets, from Fei-Fei Li’s 2006 ImageNet to the LAION-5B as used in 2022 Stable Diffusion, is that the labels themselves are composed of natural language. Said another way, because the scope of the images includes objects from the physical world, the labels necessarily are more nuanced than single digits. Broadly speaking, this type of problem framing is called natural language supervision.

Imagine having a large dataset of tens of millions of images, each provided with captions. Beyond simply naming the objects, a caption gives you more information about the content of the images. A caption can be anything from Stella sits on a yellow couch to Pepper, the Australian pup

Tausende von E-Books und Hörbücher

Ihre Zahl wächst ständig und Sie haben eine Fixpreisgarantie.

Sie haben über uns geschrieben:

Pretrain Vision and Large Language Models in Python E-Book

Emily Webber

Pretrain Vision and Large Language Models in Python

Pretrain Vision and Large Language Models in Python

Foreword

Contributors

About the author

Acknowledgment

About the reviewer

Table of Contents

Preface

Part 1: Before Pretraining

1

An Introduction to Pretraining Foundation Models

The art of pretraining and fine-tuning

The Transformer model architecture and self-attention

State-of-the-art vision and language models

Top vision models as of April 2023

Contrastive pretraining and natural language supervision

Top language models as of April 2023

Language technique spotlight – causal modeling and the scaling laws

Encoders and decoders

Summary

References

2

Dataset Preparation: Part One

Finding a dataset and use case for foundation modeling

Top pretraining use cases by industry

Delta – how different is your dataset?

Use the scaling laws to size your datasets

Fundamentals – scaling laws of neural language models

Bias detection and mitigation

Enhancing your dataset – multilingual, multimodal, and augmentations

Summary

References

3

Model Preparation

Finding your best base model

Starting with the smallest base model you can

Trade-off – simplicity versus complexity

Finding your pretraining loss function

Pretraining loss functions in vision – ViT and CoCa

Pretraining loss functions in language – Alexa Teacher Model

Changing your pretraining loss function

Solving for your model size

Practical approaches to solving for your model size

Not all scaling laws are created equal

Planning future experiments

Summary

References

Part 2: Configure Your Environment

4

Containers and Accelerators on the Cloud

What are accelerators and why do they matter?

Getting ready to use your accelerators

How to use accelerators on AWS – Amazon SageMaker

Optimizing accelerator performance

Hyperparameters

Infrastructure optimizations for accelerators on AWS

Troubleshooting accelerator performance

Summary

References

5

Distribution Fundamentals

Understanding key concepts – data and model parallelism

What data parallel is all about

What model parallel is all about

Combining model and data parallel

Distributed training on Amazon SageMaker

Distributed training software

SM DDP

SMP library

Advanced techniques to reduce GPU memory

Tensor parallelism

Optimizer state sharding

Activation checkpointing

Sharded data parallelism

Bringing it all home with examples from models today

Stable Diffusion – data parallelism at scale

GPT-3 – model and data parallelism at scale