39,59 €
Transformers for Natural Language Processing and Computer Vision, Third Edition, explores Large Language Model (LLM) architectures, applications, and various platforms (Hugging Face, OpenAI, and Google Vertex AI) used for Natural Language Processing (NLP) and Computer Vision (CV).
The book guides you through different transformer architectures to the latest Foundation Models and Generative AI. You’ll pretrain and fine-tune LLMs and work through different use cases, from summarization to implementing question-answering systems with embedding-based search techniques. You will also learn the risks of LLMs, from hallucinations and memorization to privacy, and how to mitigate such risks using moderation models with rule and knowledge bases. You’ll implement Retrieval Augmented Generation (RAG) with LLMs to improve the accuracy of your models and gain greater control over LLM outputs.
Dive into generative vision transformers and multimodal model architectures and build applications, such as image and video-to-text classifiers. Go further by combining different models and platforms and learning about AI agent replication.
This book provides you with an understanding of transformer architectures, pretraining, fine-tuning, LLM use cases, and best practices.
Das E-Book können Sie in Legimi-Apps oder einer beliebigen App lesen, die das folgende Format unterstützen:
Seitenzahl: 897
Veröffentlichungsjahr: 2024
Transformers for Natural Language Processing and Computer Vision
Third Edition
Explore Generative AI and Large Language Models with Hugging Face, ChatGPT, GPT-4V, and DALL-E 3
Denis Rothman
BIRMINGHAM—MUMBAI
Transformers for Natural Language Processing and Computer Vision
Third Edition
Copyright © 2024 Packt Publishing
All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.
Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing or its dealers and distributors, will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book.
Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.
Publishing Product Manager: Bhavesh Amin
Acquisition Editor – Peer Reviews: Tejas Mhasvekar
Project Editor: Janice Gonsalves
Content Development Editor: Bhavesh Amin
Copy Editor: Safis Editing
Technical Editor: Karan Sonawane
Proofreader: Safis Editing
Indexer: Rekha Nair
Presentation Designer: Ajay Patule
First published: January 2021
Second edition: February 2022
Third edition: February 2024
Revised edition: September 2024
Production reference: 2060924
Published by Packt Publishing Ltd.
Grosvenor House
11 St Paul’s Square
Birmingham
B3 1RB, UK.
ISBN 978-1-80512-872-4
www.packt.com
Denis Rothman graduated from Sorbonne University and Paris Diderot University, designing one of the first patented encoding and embedding systems. He authored one of the first patented AI cognitive robots and bots. He began his career delivering Natural Language Processing (NLP) chatbots for Moët et Chandon and as an AI tactical defense optimizer for Airbus (formerly Aerospatiale).
Denis then authored an AI resource optimizer for IBM and luxury brands, leading to an Advanced Planning and Scheduling (APS) solution used worldwide.
I want to thank the corporations that trusted me from the start to deliver artificial intelligence solutions and shared the risks of continuous innovation. I also want to thank my family, who always believed I would make it.
George Mihaila has 7 years of research experience with transformer models, having started working with them since they came out in 2017. He is a final-year PhD student in computer science working in research on transformer models in Natural Language Processing (NLP). His research covers both Generative and Predictive NLP modeling.
He has over 6 years of industry experience working in top companies with transformer models and machine learning, covering a broad area from NLP and Computer Vision to Explainability and Causality. George has worked in both science and engineering roles. He is an end-to-end Machine Learning expert leading Research and Development, as well as MLOps, optimization, and deployment.
He was a technical reviewer for the first and second editions of Transformers for Natural Language Processing by Denis Rothman.
Join our community’s Discord space for discussions with the authors and other readers:
https://www.packt.link/Transformers
Transformer-driven Generative AI models are a game-changer for Natural Language Processing (NLP) and computer vision. Large Language Generative AI transformer models have achieved superhuman performance through services such as ChatGPT with GPT-4V for text, image, data science, and hundreds of domains. We have gone from primitive Generative AI to superhuman AI performance in just a few years!
Language understanding has become the pillar of language modeling, chatbots, personal assistants, question answering, text summarizing, speech-to-text, sentiment analysis, machine translation, and more. The expansion from the early Large Language Models (LLMs) to multimodal (text, image, sound) algorithms has taken AI into a new era.
For the past few years, we have been witnessing the expansion of social networks versus physical encounters, e-commerce versus physical shopping, digital newspapers, streaming versus physical theaters, remote doctor consultations versus physical visits, remote work instead of on-site tasks, and similar trends in hundreds more domains. This digital activity is now increasingly driven by transformer copilots in hundreds of applications.
The transformer architecture began just a few years ago as revolutionary and disruptive. It broke with the past, leaving the dominance of RNNs and CNNs behind. BERT and GPT models abandoned recurrent network layers and replaced them with self-attention. But in 2023, OpenAI GPT-4 propelled AI into new realms with GPT-4V (vision transformer), which is paving the path for functional (everyday tasks) AGI. Google Vertex AI offered similar technology. 2024 is not a new year in AI; it’s a new decade! Meta (formerly Facebook) has released Llama 2, which we can deploy seamlessly on Hugging Face.
Transformer encoders and decoders contain attention heads that train separately, parallelizing cutting-edge hardware. Attention heads can run on separate GPUs, opening the door to billion-parameter models and soon-to-come trillion-parameter models.
The increasing amount of data requires training AI models at scale. As such, transformers pave the way to a new era of parameter-driven AI. Learning to understand how hundreds of millions of words and images fit together requires a tremendous amount of parameters. Transformer models such as Google Vertex AI PaLM 2 and OpenAI GPT-4V have taken emergence to another level. Transformers can perform hundreds of NLP tasks they were not trained for.
Transformers can also learn image classification and reconstruction by embedding images as sequences of words. This book will introduce you to cutting-edge computer vision transformers such as Vision Transformers (ViTs), CLIP, GPT-4V, DALL-E 3, and Stable Diffusion.
Think of how many humans it would take to control the content of the billions of messages posted on social networks per day to decide if they are legal and ethical before extracting the information they contain.
Think of how many humans would be required to translate the millions of pages published each day on the web. Or imagine how many people it would take to manually control the millions of messages and images made per minute!
Imagine how many humans it would take to write the transcripts of all of the vast amount of hours of streaming published per day on the web. Finally, think about the human resources that would be required to replace AI image captioning for the billions of images that continuously appear online.
This book will take you from developing code to prompt engineering, a new “programming” skill that controls the behavior of a transformer model. Each chapter will take you through the key aspects of language understanding and computer vision from scratch in Python, PyTorch, and TensorFlow.
You will learn the architecture of the Original Transformer, Google BERT, GPT-4, PaLM 2, T5, ViT, Stable Diffusion, and several other models. You will fine-tune transformers, train models from scratch, and learn to use powerful APIs.
You will keep close to the market and its demand for language understanding in many fields, such as media, social media, and research papers, for example. You will learn how to improve Generative AI models with Retrieval Augmented Generation (RAG), embedding-based searches, prompt engineering, and automated ideation with AI-generated prompts.
Throughout the book, you will work hands-on with Python, PyTorch, and TensorFlow. You will be introduced to the key AI language understanding neural network models. You will then learn how to explore and implement transformers.
You will learn the skills required not only to adapt to the present market but also to acquire the vision to face innovative projects and AI evolutions. This book aims to give readers both the knowledge and the vision to select the right models and environment for any given project.
This book is not an introduction to Python programming or machine learning concepts. Instead, it focuses on deep learning for machine translation, speech-to-text, text-to-speech, language modeling, question answering, and many more NLP domains, as well as computer vision multimodal tasks.
Readers who can benefit the most from this book are:
Deep learning, vision, and NLP practitioners familiar with Python programming.Data analysts, data scientists, and machine learning/AI engineers who want to understand how to process and interrogate the increasing amounts of language-driven and image data.Chapter 1, What Are Transformers?, explains, at a high level, what transformers and Foundation Models are. We will first unveil the incredible power of the deceptively simple O(1) time complexity of transformer models that changed everything. We will continue to discover how a hardly known transformer algorithm in 2017 rose to dominate so many domains and brought us Foundation Models.
Chapter 2, Getting Started with the Architecture of the Transformer Model, goes through the background of NLP to understand how RNN, LSTM, and CNN architectures were abandoned and how the transformer architecture opened a new era. We will go through the Original Transformer’s architecture through the unique Attention Is All You Need approach invented by the Google Research and Google Brain authors. We will describe the theory of transformers. We will get our hands dirty in Python to see how multi-attention head sublayers work.
Chapter 3, Emergent vs Downstream Tasks: The Unseen Depths of Transformers, bridges the gap between the functional and mathematical architecture of transformers by introducing emergence. We will then see how to measure the performance of transformers before exploring several downstream tasks, such as the Stanford Sentiment TreeBank (SST-2), linguistic acceptability, and Winograd schemas.
Chapter 4, Advancements in Translations with Google Trax, Google Translate, and Gemini, goes through machine translation in three steps. We will first define what machine translation is. We will then preprocess a Workshop on Machine Translation (WMT) dataset. Finally, we will see how to implement machine translations.
Chapter 5, Diving into Fine-Tuning through BERT, builds on the architecture of the Original Transformer. Bidirectional Encoder Representations from Transformers (BERT) takes transformers into a vast new way of perceiving the world of NLP. Instead of analyzing a past sequence to predict a future sequence, BERT attends to the whole sequence! We will first go through the key innovations of BERT’s architecture and then fine-tune a BERT model by going through each step in a Google Colaboratory notebook. Like humans, BERT can learn tasks and perform other new ones without having to learn the topic from scratch.
Chapter 6, Pretraining a Transformer from Scratch through RoBERTa, builds a RoBERTa transformer model from scratch using the Hugging Face PyTorch modules. The transformer will be both BERT-like and DistilBERT-like. First, we will train a tokenizer from scratch on a customized dataset. Finally, we will put the knowledge acquired in this chapter to work and pretrain a Generative AI customer support model on X (formerly Twitter) data.
Chapter 7, The Generative AI Revolution with ChatGPT, goes through the tremendous improvements and diffusion of ChatGPT models into the everyday lives of developers and end-users. We will first examine the architecture of OpenAI’s GPT models before working with the GPT-4 API and its hyperparameters to implement several NLP examples. Finally, we will learn how to obtain better results with Retrieval Augmented Generation (RAG). We will implement an example of automated RAG with GPT-4.
Chapter 8, Fine-Tuning OpenAI GPT Models, explores fine-tuning to make sense of the choices we can make for a project to go in this direction or not. We will introduce risk management perspectives. We will prepare a dataset and fine-tune a cost-effective babbage-02 model for a completion task.
Chapter 9, Shattering the Black Box with Interpretable Tools, lifts the lid on the black box that is transformer models by visualizing their activity. We will use BertViz to visualize attention heads, Language Interpretability Tool (LIT) to carry out a Principal Component Analysis (PCA), and LIME to visualize transformers via dictionary learning. OpenAI LLMs will take us deeper and visualize the activity of a neuron in a transformer with an interactive interface. This approach opens the door to GPT-4 explaining a transformer, for example.
Chapter 10, Investigating the Role of Tokenizers in Shaping Transformer Models, introduces some tokenizer-agnostic best practices to measure the quality of a tokenizer. We will describe basic guidelines for datasets and tokenizers from a tokenization perspective. We will explore word and subword tokenizers and show how a tokenizer can shape a transformer model’s training and performance. Finally, we will build a function to display and control token-ID mappings.
Chapter 11, Leveraging LLM Embeddings as an Alternative to Fine-Tuning, explains why searching with embeddings can sometimes be a very effective alternative to fine-tuning. We will go through the advantages and limits of this approach. We will go through the fundamentals of text embeddings. We will build a program that reads a file, tokenizes it, and embeds it with Gensim and Word2Vec. We will implement a question-answering program on sports events and use OpenAI Ada to embed Amazon fine food reviews. By the end of the chapter, we will have taken a system from prompt design to advanced prompt engineering using embeddings for RAG.
Chapter 12, Toward Syntax-Free Semantic Role Labeling with ChatGPT and GPT-4, goes through the revolutionary concepts of syntax-free, nonrepetitive stochastic models. We will use ChatGPT Plus with GPT-4 to run easy to complex Semantic Role Labeling (SRL) samples. We will see how a general-purpose, emergent model reacts to our SRL requests. We will progressively push the transformer model to the limits of SRL.
Chapter 13, Summarization with T5 and ChatGPT, goes through the concepts and architecture of the T5 transformer model. We will then apply T5 to summarize documents with Hugging Face models. The examples in this chapter will be legal and medical to explore domain-specific summarization beyond simple texts. We are not looking for an easy way to implement NLP but preparing ourselves for the reality of real-life projects. We will then compare T5 and ChatGPT approaches to summarization.
Chapter 14, Exploring Cutting-Edge LLMs with Vertex AI and PaLM 2, examines Pathways to understand PaLM. We will continue and look at the main features of PaLM (Pathways Language Model), a decoder-only, densely activated, and autoregressive transformer model with 540 billion parameters trained on Google’s Pathways system. We will see how Google PaLM 2 can perform a chat task, a discriminative task (such as classification), a completion task (also known as a generative task), and more. We will implement the Vertex AI PaLM 2 API for several NLP tasks, including question-answering and summarization. Finally, we will go through Google Cloud’s fine-tuning process.
Chapter 15, Guarding the Giants: Mitigating Risks in Large Language Models, examines the risks of LLMs, risk management, and risk mitigation tools. The chapter explains hallucinations, memorization, risky emergent behavior, disinformation, influence operations, harmful content, adversarial attacks (“jailbreaks”), privacy, cybersecurity, overreliance, and memorization. We will then go through some risk mitigation tools through advanced prompt engineering, such as implementing a moderation model, a knowledge base, keyword parsing, prompt pilots, post-processing moderation, and embeddings.
Chapter 16, Beyond Text: Vision Transformers in the Dawn of Revolutionary AI, explores the innovative transformer models that respect the basic structure of the Original Transformer but make some significant changes. We will discover powerful computer vision transformers like ViT, CLIP, DALL-E, and GPT-4V. We will implement vision transformers in code, including GPT-4V, and expand the text-image interactions of DALL-3 to divergent semantic association. We will take OpenAI models into the nascent world of highly divergent semantic association creativity.
Chapter 17, Transcending the Image-Text Boundary with Stable Diffusion, delves into to diffusion models, introducing Stable Vision, which has created a disruptive generative image AI wave rippling through the market. We will then dive into the principles, math, and code of the remarkable Keras Stable Diffusion model. We will go through each of the main components of a Stable Diffusion model and peek into the source code provided by Keras and run the model. We will run a text-to-video synthesis model with Hugging Face and a video-to-text task with Meta’s TimeSformer.
Chapter 18, Hugging Face AutoTrain: Training Vision Models without Coding, explores how to train a vision transformer using Hugging Face’s AutoTrain. We will go through the automated training process and discover the unpredictable problems that show why even automated ML requires human AI expertise. The goal of this chapter is also to show how to probe the limits of a computer vision model, no matter how sophisticated it is.
Chapter 19, On the Road to Functional AGI with HuggingGPT and its Peers, shows how we can use cross-platform chained models to solve difficult image classification problems. We will put HuggingGPT and Google Cloud Vision to work to identify easy, difficult, and very difficult images. We will go beyond classical pipelines and explore how to chain heterogeneous competing models.
Chapter 20, Beyond Human-Designed Prompts with Generative Ideation, explores generative ideation, an ecosystem that automates the production of an idea to text and image content. The development phase requires highly skilled human AI experts. For an end user, the ecosystem is a click-and-run experience. By the end of this chapter, we will be able to deliver ethical, exciting, generative ideation to companies with no marketing resources. We will be able to expand generative ideation to any field in an exciting, cutting-edge, yet ethical ecosystem.
Appendix A, Revolutionizing AI, The Power of Optimized Time Complexity in Transformer Models, gives you a detailed explanation of O(1) time complexity, explains what it is, how it works, and why it’s better than the O(n) alternative. This appendix also explores the token-to-token approach used by transformers.
Appendix B, Answers to the Questions, provides answers to all of the questions that you will find at the end of each chapter.
Most of the programs in the book are Jupyter notebooks. All you will need is a free Google Gmail account, and you will be able to run the notebooks on Google Colaboratory’s free VM.
Take the time to read Chapter 2, Getting Started with the Architecture of the Transformer Model. Chapter 2 contains the description of the Original Transformer. If you find it difficult, then pick up the general intuitive ideas from the chapter. You can then go back to that chapter when you feel more comfortable with transformers after a few chapters.
After reading each chapter, consider how you could implement transformers for your customers or use them to move up in your career with novel ideas.
The code bundle for the book is hosted on GitHub at https://github.com/Denis2054/Transformers-for-NLP-and-Computer-Vision-3rd-Edition. We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/. Check them out!
We also provide a PDF file that contains color images of the screenshots/diagrams used in this book. You can download it here: https://packt.link/gbp/9781805128724.
There are several text conventions used throughout this book.
CodeInText: Indicates sentences and words run through the models in the book, code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles. For example, “However, if you wish to explore the code, you will find it in the Google Colaboratory positional_encoding.ipynb notebook and the text.txt file in this chapter’s GitHub repository.”
A block of code is set as follows:
import numpy as np from scipy.special import softmaxWhen we wish to draw your attention to a particular part of a code block, the relevant lines or items are set in bold:
The blackbrown cat sat on the couch and the dog slept on the rug.Any command-line input or output is written as follows:
vector similarity [[0.9627094]] final positional encoding similarityBold: Indicates a new term, an important word, or words that you see on the screen.
For instance, words in menus or dialog boxes also appear in the text like this. For example:
“In our case, we are looking for t5-large, a t5-large model we can smoothly run in
Google Colaboratory.”
Warnings or important notes appear like this.
Tips and tricks appear like this.
Feedback from our readers is always welcome.
General feedback: Email [email protected] and mention the book’s title in the subject of your message. If you have questions about any aspect of this book, please email us at [email protected].
Errata: Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you have found a mistake in this book, we would be grateful if you would report this to us. Please visit http://www.packtpub.com/submit-errata, selecting your book, clicking on the Errata Submission Form link, and entering the details.
Piracy: If you come across any illegal copies of our works in any form on the internet, we would be grateful if you would provide us with the location address or website name. Please contact us at [email protected] with a link to the material.
If you are interested in becoming an author: If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, please visit http://authors.packtpub.com.
Once you’ve read Transformers for Natural Language Processing and Computer Vision - Third Edition, we’d love to hear your thoughts! Please click here to go straight to the Amazon review page for this book and share your feedback.
Your review is important to us and the tech community and will help us make sure we’re delivering excellent quality content.
Thanks for purchasing this book!
Do you like to read on the go but are unable to carry your print books everywhere?
Is your eBook purchase not compatible with the device of your choice?
Don’t worry, now with every Packt book you get a DRM-free PDF version of that book at no cost.
Read anywhere, any place, on any device. Search, copy, and paste code from your favorite technical books directly into your application.
The perks don’t stop there, you can get exclusive access to discounts, newsletters, and great free content in your inbox daily
Follow these simple steps to get the benefits:
Scan the QR code or visit the link belowhttps://packt.link/free-ebook/9781805128724
Submit your proof of purchaseThat’s it! We’ll send your free PDF and other benefits to your email directly