35,99 €
ChatGPT and the GPT models by OpenAI have brought about a revolution not only in how we write and research but also in how we can process information. This book discusses the functioning, capabilities, and limitations of LLMs underlying chat systems, including ChatGPT and Gemini. It demonstrates, in a series of practical examples, how to use the LangChain framework to build production-ready and responsive LLM applications for tasks ranging from customer support to software development assistance and data analysis – illustrating the expansive utility of LLMs in real-world applications.
Unlock the full potential of LLMs within your projects as you navigate through guidance on fine-tuning, prompt engineering, and best practices for deployment and monitoring in production environments. Whether you're building creative writing tools, developing sophisticated chatbots, or crafting cutting-edge software development aids, this book will be your roadmap to mastering the transformative power of generative AI with confidence and creativity.
Das E-Book können Sie in Legimi-Apps oder einer beliebigen App lesen, die das folgende Format unterstützen:
Seitenzahl: 461
Veröffentlichungsjahr: 2023
Generative AI with LangChain
Build large language model (LLM) apps with Python, ChatGPT, and other LLMs
Ben Auffarth
BIRMINGHAM—MUMBAI
Generative AI with LangChain
Copyright © 2024 Packt Publishing
All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.
Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing or its dealers and distributors, will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book.
Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.
Senior Publishing Product Manager: Tushar Gupta
Acquisition Editor – Peer Reviews: Tejas Mhasvekar
Project Editor: Namrata Katare
Content Development Editors: Tanya D’cruz and Elliot Dallow
Copy Editor: Safis Editing
Technical Editor: Kushal Sharma
Proofreader: Safis Editing
Indexer: Manju Arasan
Presentation Designer: Ajay Patule
Developer Relations Marketing Executive: Monika Sangwan
First published: December 2023
Revised publication: September 2024
Production reference: 3040924
Published by Packt Publishing Ltd.
Grosvenor House
11 St Paul’s Square
Birmingham
B3 1RB, UK.
ISBN 978-1-83508-346-8
www.packt.com
To Diane and Nico
– Ben Auffarth
Ben Auffarth is a seasoned data science leader with a background and Ph.D. in computational neuroscience. Ben has analyzed terabytes of data, simulated brain activity on supercomputers with up to 64k cores, designed and conducted wet lab experiments, built production systems processing underwriting applications, and trained neural networks on millions of documents. He’s the author of the books Machine Learning for Time Series and Artificial Intelligence with Python Cookbook. He now works in insurance at Hastings Direct.
Creating this book has been a long and sometimes arduous journey, but also an exciting one. It has been enriched immeasurably by the contributions of several key individuals to whom I owe great thanks. Foremost, I extend my heartfelt gratitude to Leo, whose insightful feedback significantly refined this book. I am equally delighted with my astute editors — Tanya, Elliot, and Kushal. Their efforts went above and beyond expectations. Tanya, in particular, was instrumental in guiding me through the writing process, continually challenging me to clarify my thoughts and significantly shaping the final product.
Leonid Ganeline is a machine learning engineer with extensive experience in natural language processing. He has worked in several start-ups, creating models and production systems. He is an active contributor to LangChain and several other open-source projects. His interest lies in model evaluation, especially in LLM evaluation
I would like to express my gratitude to my parents, for teaching me how to think rationally, and to my wife, for supporting me in this endeavor.
Ruchi Bhatia is a computer engineer with a Master’s degree in information systems management from Carnegie Mellon University. Currently, she is leveraging her skills as a product marketing manager in the rapidly evolving field of data science and AI at HP. She takes pride in being the youngest triple Kaggle Grandmaster across the Notebooks, Datasets, and Discussion categories. Her previous role as the Leader of Data Science at OpenMined allowed her to steer a team of data scientists to create innovative and impactful solutions.
I want to take a moment to express my heartfelt thanks to my parents. Their unwavering support and encouragement throughout my journey have been invaluable. Without their belief in my abilities and their constant guidance, I wouldn’t have achieved the milestones I have today. Thank you, Mom and Dad, for always being there for me.
Join our community's Discord space for discussions with the authors and other readers:
https://packt.link/lang
Preface
Who this book is for
What this book covers
To get the most out of this book
Work with notebooks and projects
Get in touch
What Is Generative AI?
Introducing generative AI
What are generative models?
Why now?
Understanding LLMs
How do GPT models work?
Transformers
Pre-training
Tokenization
Conditioning
How have GPT models evolved?
Model size
The GPT model series
PaLM and Gemini
Llama and Llama 2
Claude 1–3
Mixture of Experts (MoE)
How to use LLMs
What are text-to-image models?
What can AI do in other domains?
Summary
Questions
LangChain for LLM Apps
Going beyond stochastic parrots
What are the limitations of LLMs?
How can we mitigate LLM limitations?
What is an LLM app?
What is LangChain?
Exploring key components of LangChain
What are chains?
What are agents?
What is memory?
What are tools?
How does LangChain work?
LangChain package structure
Comparing LangChain with other frameworks
Summary
Questions
Getting Started with LangChain
How to set up the dependencies for this book
Exploring cloud integrations
Environment setup and API keys
OpenAI
Hugging Face
Building blocks for LLM interaction
LLMs
Fake LLM
Chat models
Prompts
Chains
LangChain Expression Language
Text-to-Image
Dall-E
Replicate
Image understanding
Running local models
Hugging Face Transformers
llama.cpp
GPT4All
Prototyping an application for customer service
Sentiment analysis
Text classification
Document summarization
Applying map-reduce
Monitoring token usage
Summary
Questions
Building Capable Assistants
Answering questions with tools
Tool use
Defining custom tools
Tool decorator
Subclassing BaseTool
StructuredTool dataclass
Error handling
Implementing a research assistant with tools
Building a visual interface
Exploring agent architectures
Extracting structured information from documents
Mitigating hallucinations through fact-checking
Summary
Questions
Building a Chatbot Like ChatGPT
What is a chatbot?
From vectors to RAG
Vector embeddings
Embeddings in LangChain
Vector storage
Vector indexing
Vector libraries
Vector databases
Document loaders
Retrievers in LangChain
kNN retriever
PubMed retriever
Custom retrievers
Implementing a chatbot with a retriever
Document loaders
Vector storage
Conversation Memory: Preserving Context
ConversationBufferMemory
ConversationBufferWindowMemory
ConversationSummaryMemory
ConversationKGMemory
CombinedMemory
Long-term persistence
Moderating responses
Guardrails
Summary
Questions
Developing Software with Generative AI
Software development and AI
Code LLMs
Writing code with LLMs
Vertex AI
StarCoder
StarChat
Llama 2
Small local model
Automating software development
Implementing a feedback loop
Tool use
Error handling
Finishing touches to our developer
Summary
Questions
LLMs for Data Science
The impact of generative models on data science
Automated data science
Data collection
Visualization and EDA
Preprocessing and feature extraction
AutoML
Using agents to answer data science questions
Data exploration with LLMs
Summary
Questions
Customizing LLMs and Their Output
Conditioning LLMs
Methods for conditioning
Reinforcement learning with human feedback
Low-rank adaptation
Inference-time conditioning
Fine-tuning
Setup for fine-tuning
Open-source models
Commercial models
Prompt engineering
Prompt techniques
Zero-shot prompting
Few-shot learning
CoT prompting
Self-consistency
ToT
Summary
Questions
Generative AI in Production
How to get LLM apps ready for production
How to evaluate LLM apps
Comparing two outputs
Comparing against criteria
String and semantic comparisons
Running evaluations against datasets
How to deploy LLM apps
FastAPI web server
Ray
How to observe LLM apps
Tracking responses
Observability tools
LangSmith
PromptWatch
Summary
Questions
The Future of Generative Models
The current state of generative AI
Challenges
Trends in model development
Big Tech vs. small enterprises
Artificial General Intelligence
Economic consequences
Creative industries
Education
Law
Manufacturing
Medicine
Military
Societal implications
Misinformation and cybersecurity
Regulations and implementation challenges
The road ahead
Other Books You May Enjoy
Index
Cover
Index
Once you’ve read Generative AI with LangChain, we’d love to hear your thoughts! Please click here to go straight to the Amazon review page for this book and share your feedback.
Your review is important to us and the tech community and will help us make sure we’re delivering excellent quality content.
Thanks for purchasing this book!
Do you like to read on the go but are unable to carry your print books everywhere? Is your eBook purchase not compatible with the device of your choice?
Don’t worry, now with every Packt book you get a DRM-free PDF version of that book at no cost.
Read anywhere, any place, on any device. Search, copy, and paste code from your favorite technical books directly into your application.
The perks don’t stop there, you can get exclusive access to discounts, newsletters, and great free content in your inbox daily
Follow these simple steps to get the benefits:
Scan the QR code or visit the link belowhttps://packt.link/free-ebook/9781835083468
Submit your proof of purchaseThat’s it! We’ll send your free PDF and other benefits to your email directlyOver the past decade, deep learning has evolved massively to process and generate unstructured data like text, images, and video. These advanced AI models have gained popularity in various industries, and include large language models (LLMs). There is currently a significant amount of fanfare in both the media and the industry surrounding AI, and there’s a fair case to be made that Artificial Intelligence (AI), with these advancements, is about to have a wide-ranging and major impact on businesses, societies, and individuals alike. This is driven by numerous factors, including advancements in technology, high-profile applications, and the potential for transformative impacts across multiple sectors.
In this chapter, we’ll explore generative models and their basics. We’ll provide an overview of the technical concepts and training approaches that power these models’ ability to produce novel content. While we won’t be diving deep into generative models for sound or video, we aim to convey a high-level understanding of how techniques like neural networks, large datasets, and computational scale enable generative models to reach new capabilities in text and image generation. The goal is to demystify the underlying magic that allows these models to generate remarkably human-like content across various domains. With this foundation, readers will be better prepared to consider both the opportunities and challenges posed by this rapidly advancing technology.
We’ll follow this structure:
Introducing generative AIUnderstanding LLMsModel developmentWhat are text-to-image models?What can AI do in other domains?Let’s start from the beginning – with the terminology!
In the media, there is substantial coverage of AI-related breakthroughs and their potential implications. These range from advancements in Natural Language Processing (NLP) and computer vision to the development of sophisticated language models like GPT-4. Particularly, generative models have received a lot of attention due to their ability to generate text, images, and other creative content that is often indistinguishable from human-generated content. These same models also provide wide functionality, including semantic search, content manipulation, and classification. This allows cost savings with automation and allows humans to leverage their creativity to an unprecedented level.
Generative AI refers to algorithms that can generate novel content, as opposed to analyzing or acting on existing data like more traditional, predictive machine learning or AI systems.
Benchmarks capturing task performance in different domains have been major drivers of the development of these models. The Massive Multitask Language Understanding (MMLU) benchmark is a comprehensive suite of 57 tasks spanning diverse domains like math, history, computer science, and law. It serves as a standardized way to evaluate the multitask performance and broad capabilities of LLMs in both zero-shot and few-shot settings. The MMLU benchmark’s importance lies in providing a challenging and multifaceted test of a model’s understanding and problem-solving abilities across a wide range of topics. It allows for systematic comparisons between different LLMs and tracks progress in developing models with robust language understanding and reasoning skills beyond narrow domains.
The following graph, inspired by a blog post titled GPT-4 Predictions by Stephen McAleese on LessWrong, shows the improvements of LLMs in the benchmark:
Figure 1.1: Average performance on the MMLU benchmark of LLMs
Please note that results should be taken with a pinch of salt since they are self-reported and are obtained either by 5-shot or 0-shot conditioning. Most benchmark results come from 5-shot (indicated by an “o”). A few, like the GPT-2, PaLM, and PaLM-2 results, refer to zero-shot (“x”).
From the preceding graph, we can see significant improvements in recent years in the MMLU benchmark. Particularly, it highlights the progress of the models provided through a public user interface by OpenAI, especially the improvements between releases, from GTP-2 to GPT-3 and GPT-3.5 to GPT-4.
The graph shows the MMLU performance of models that have either prompted a question directly (zero-shot) or together with examples – typically 5 (few-shot). The added examples result in a 20% boost in the model’s performance according to Measuring Massive Multitask Language Understanding (Hendrycks et al., revised in 2023).
It is difficult to definitively declare the strongest LLM among Claude 3, GPT-4, and Gemini, as their performances appear to be closely matched and vary across different tasks. Ultimately, the choice of the strongest LLM may depend on specific use cases and requirements, including their costs.
There are a few differences between these models and the way they are trained that can account for differences in performance, such as scale, instruction tuning, a tweak to the attention mechanisms, and the choice of training data. First and foremost, the massive scaling up of parameters from 1.5 billion (GPT-2) to 175 billion (GPT-3) to more than a trillion (GPT-4) enables models to learn more complex patterns; however, another major change in early 2022 was the post-training fine-tuning of models based on human instructions, which teaches the model how to perform a task by providing demonstrations and feedback.
Across benchmarks, a few models have recently started to perform better than an average human rater, but generally, they still haven’t reached the performance of a human expert. These achievements of human engineering are impressive; however, it should be noted that the performance of these models depends on the field; most models are still performing poorly on the GSM8K benchmark of grade school math word problems. As AI models like OpenAI’s GPT continue to improve, they could become indispensable assets to teams in need of diverse knowledge and skills.
You could consider strong LLMs like GPT 4 or Claude 3 a polymath that works tirelessly without demanding compensation (beyond subscription or API fees), providing competent assistance in subjects like mathematics and statistics, macroeconomics, biology, and law (the model performs well on the Uniform Bar Exam). As these AI models become more proficient and easily accessible, they are likely to play a significant role in shaping the future of work and learning.
By making knowledge more accessible and adaptable, these models have the potential to level the playing field and create new opportunities for people from all walks of life. These models have shown potential in areas that require high levels of reasoning and understanding, although progress varies depending on the complexity of the tasks involved.
As for generative models with images, they have pushed the boundaries in their capabilities to assist in creating visual content, and their performance in computer vision tasks such as object detection, segmentation, captioning, and much more.
Let’s clear up the terminology a bit and explain in more detail what is meant by generative models, artificial intelligence, deep learning, and machine learning.
In popular media, the term artificial intelligence is used a lot when referring to these new models. In theoretical and applied research circles, it is often joked that AI is just a fancy word for ML, or AI is ML in a suit, as illustrated in this image:
Figure 1.2: ML in a suit. Generated by a model on replicate.com, Diffusers Stable Diffusion v2.1
It’s worth distinguishing more clearly between the terms generative model, artificial intelligence, machine learning, deep learning, and language model:
Artificial Intelligence (AI) is a broad field of computer science focused on creating intelligent agents that can reason, learn, and act autonomously.Machine Learning (ML) is a subset of AI focused on developing algorithms that can learn from data.Deep Learning (DL) uses deep neural networks, which have many layers, as a mechanism for ML algorithms to learn complex patterns from data.Generative Models are a type of ML model that can generate new data based on patterns learned from input data.Language Models (LMs) are statistical models used to predict words in a sequence of natural language. Some language models utilize deep learning and are trained on massive datasets, becoming LLMs.The following class diagram illustrates how LLMs combine deep learning techniques like neural networks with sequence modeling objectives from language modeling at a very large scale:
Figure 1.3: Class diagram of different models. LLMs represent the intersection of deep learning techniques with language modeling objectives
Generative models are a powerful type of AI that can generate new data that resembles the training data. Generative AI models have come a long way, enabling the generation of new examples from scratch using patterns in data. These models can handle different data modalities and are employed across various domains, including text, image, music, and video. Their key distinction is that generative models synthesize new data rather than just making predictions or decisions. This enables applications like generating text, images, music, and video.
Generative models can facilitate the creation of synthetic data to train AI models when real data is scarce or restricted. This type of data generation reduces labeling costs and improves training efficiency. Microsoft Research took this approach (Textbooks Are All You Need, June 2023) when training their phi-1 model; they used GPT-3.5 to create synthetic Python textbooks and exercises.
The rapid progress across diverse domains shows the potential of generative AI. Within the industry, there is a growing sense of excitement around AI’s capabilities and its potential impact on business operations. But there are key challenges such as data availability, compute requirements, bias in data, evaluation difficulties, potential misuse, and other societal impacts that need to be addressed going forward, which we’ll discuss in Chapter 10, The Future of Generative Models.
Generative AI is extensively used in generating 3D images, avatars, videos, graphs, and illustrations for virtual or augmented reality, video games graphic design, logo creation, image editing, or enhancement. The most popular model category here is for text-conditioned image synthesis, specifically text-to-image generation. As mentioned, in this book, we’ll focus on LLMs, since they have the broadest practical application, but we’ll also have a look at image models, which sometimes can be quite useful.
Let’s delve a bit more into this progress and pose the question why is it happening now and what conditions have made this advancement possible?
The success of generative AI is due to several factors, including:
Improved algorithmsConsiderable advances in computer power and hardware designThe availability of large, labeled datasetsAn active and collaborative research communityAdditionally, the development of more sophisticated mathematical and computational methods has played a vital role in advancing generative models. An example is the backpropagation algorithm, which was introduced in the 1980s and provides a way to effectively train multi-layer neural networks.
In the 2000s, neural networks began to regain popularity as researchers developed more complex architectures. However, it was the advent of deep learning, a type of neural network with numerous layers, that marked a significant turning point in the performance and capabilities of these models.
Although the concept of deep learning has existed for some time, the development and expansion of generative models correlate with significant advances in hardware, particularly Graphics Processing Units (GPUs), which have been instrumental in the development of deeper models. This is because deep learning models require a lot of computing power to train and run. This concerns all aspects of processing power, memory, and disk space.
The capabilities of LLMs changed dramatically once they became bigger. The more parameters a model has, the higher its capacity to capture knowledge relationships between words and phrases. As a simple example of these higher-order correlations, an LLM could learn that the word “cat” is more likely to be followed by the word “dog” if it is preceded by the word “chase,” even if there are other words in between. Generally, the lower a model’s perplexity, the better it will perform, for example, in terms of answering questions.
Particularly, it seems that in models with between 2 and 7 billion parameters, new capabilities emerge such as the ability to generate different creative text in formats like poems, code, scripts, musical pieces, emails, and letters, and to answer even open-ended and challenging questions in an informative way.
LLMs are deep neural networks that are adept at understanding and generating human language. These models have practical applications in fields like content creation and NLP, where the ultimate goal is to create algorithms capable of understanding and generating natural language text.
The current generation of LLMs such as GPT-4 and others are deep neural network architectures that utilize the transformer model and undergo pre-training using unsupervised learning on extensive text data, enabling the model to learn language patterns and structures. Models have evolved rapidly, enabling the creation of versatile foundational AI models that are suitable for a wide range of downstream tasks and modalities, ultimately driving innovation across various applications and industries.
The notable strength of the latest generation of LLMs as conversational interfaces (chatbots) lies in their ability to generate coherent and contextually appropriate responses, even in open-ended conversations. By generating the next word based on the preceding words repeatedly, the model produces fluent and coherent text that is often indistinguishable from text produced by humans.
At its core, language modeling, and more broadly NLP, relies heavily on the quality of representation learning. A generative language model encodes information about the text that it has been trained on and generates new text based on what it has learned, thereby taking on the task of text generation.
Representation learning is about a model learning its internal representations of raw data to perform a machine learning task, rather than relying only on engineered feature extraction. For example, an image classification model based on representation learning might learn to represent images according to visual features like edges, shapes, and textures. The model isn’t told explicitly what features to look for – it learns representations of the raw pixel data that help it make predictions.
Recently, LLMs have been used in tasks like copywriting, code development, translation, and understanding genetic sequences. More broadly, applications of language models involve multiple areas, such as:
Question answering: AI chatbots and virtual assistants can provide personalized and efficient assistance, reducing response times in customer support and thereby enhancing customer experience. These systems can be used in specific contexts like restaurant reservations and ticket booking.Automatic summarization: Language models can create concise summaries of articles, research papers, and other content, enabling users to consume and understand information rapidly.Sentiment analysis: By analyzing opinions and emotions in texts, language models can help businesses understand customer feedback and opinions more efficiently.Topic modeling: LLMs can discover abstract topics and themes across a corpus of documents. They identify word clusters and latent semantic structures.Semantic search: LLMs can focus on understanding meaning within individual documents. They use NLP to interpret words and concepts for improved search relevance.Machine translation: Language models can translate texts from one language into another, supporting businesses in their global expansion efforts. New generative models can perform on par with commercial products (for example, Google Translate).Despite their remarkable achievements, language models still face limitations when dealing with complex mathematical or logical reasoning tasks. It remains uncertain whether continually increasing the scale of language models will inevitably lead to new reasoning capabilities. Further, LLMs are known to return the most probable answers within the context, which can sometimes yield fabricated information, called hallucinations. This is a feature as well as a bug since it highlights their creative potential.
We’ll talk about hallucinations in Chapter 5, Building a Chatbot Like ChatGPT, but for now, let’s discuss the nitty-gritty details – how do these LLMs work under the hood?
A new deep learning architecture called the Transformer emerged in 2017, introduced in an article by researchers at Google and the University of Toronto (in an article called Attention Is All You Need by Vaswani et al.). It uses self-attention, allowing it to focus on the important parts of a sentence and understand how words relate to each other.
In 2018, researchers took transformers to the next level by creating Generative Pre-trained Transformers (GPTs) (in Improving Language Understanding by Generative Pre-Training; Radford et al.). These models are trained by predicting the next word in a sequence, like a massive guessing game that helps them grasp language patterns. After this pre-training process, GPTs can be further refined for specific tasks like translation or sentiment analysis. This combines unsupervised learning (pre-training) and supervised learning (fine-tuning) for better performance across various tasks. It also reduces the difficulty of training LLMs.
Models based on transformers outperformed previous approaches, such as using recurrent neural networks, particularly Long Short-Term Memory (LSTM) networks. These recurrent neural networks such as LSTM, have a limited memory. This can be problematic for long sentences or complex ideas where earlier information is still relevant.
Transformers work differently, which means they take advantage of the full context, and they can keep learning and refining their understanding as they process more words in a sentence. This ability to leverage the entire context throughout the sentence leads to better performance for tasks like translation, summarization, and question-answering. The model can capture the nuances of longer sentences and complex relationships between words. In essence, a key reason for the success of transformers has been their ability to maintain performance across long sequences better than other models, for example, recurrent neural networks.
The transformer model architecture has an encoder-decoder structure, where the encoder maps an input sequence to a sequence of hidden states, and the decoder maps the hidden states to an output sequence. The hidden state representations consider not only the inherent meaning of the words (their semantic value) but also their context in the sequence.
The encoder is made up of identical layers, each with two sub-layers. The input embedding is passed through an attention mechanism, and the second sub-layer is a fully connected feed-forward network. Each sub-layer is followed by a residual connection and layer normalization. The output of each sub-layer is the sum of the input and the output of the sub-layer, which is then normalized.
The decoder uses this encoded information to generate the output sequence one item at a time, using the context of the previously generated items. It also has identical modules, with the same two sub-layers as the encoder. In addition, the decoder has a third sub-layer that performs Multi-Head Attention (MHA) over the output of the encoder stack. The decoder also uses residual connections and layer normalization. The self-attention sub-layer in the decoder is modified to prevent positions from attending to subsequent positions. This masking, combined with the fact that the output embeddings are offset by one position, ensures that the predictions for position i can only depend on the known outputs at positions less than i. These are indicated in the diagram here (source: Yuening Jia, Wikimedia Commons):
Figure 1.4: The Transformer architecture
The architectural features that have contributed to the success of transformers are:
Positional encoding: Since the transformer doesn’t process words sequentially but instead processes all words simultaneously, it lacks any notion of the order of words. To remedy this, information about the position of words in the sequence is injected into the model using positional encodings. These encodings are added to the input embeddings representing each word, thus allowing the model to consider the order of words in a sequence.Layer normalization: To stabilize the network’s learning, the transformer uses a technique called layer normalization. This technique normalizes the model’s inputs across the features dimension (instead of the batch dimension as in batch normalization), thus improving the overall speed and stability of learning.MHA: Instead of applying attention once, the transformer applies it multiple times in parallel, improving the model’s ability to focus on different types of information and thus capturing a richer combination of features.The basic idea behind attention mechanisms is to compute a weighted sum of the values (usually referred to as values or content vectors) associated with each position in the input sequence, based on the similarity between the current position and all other positions. This weighted sum, known as the context vector, is then used as an input to the subsequent layers of the model, enabling the model to selectively attend to relevant parts of the input during the decoding process.
To enhance the expressiveness of the attention mechanism, it is often extended to include multiple so-called heads, where each head has its own set of query, key, and value vectors, allowing the model to capture various aspects of the input representation. The individual context vectors from each head are then concatenated or combined in some way to form the final output.
Early attention mechanisms scaled quadratically with the length of the sequences (context size), rendering them inapplicable to settings with long sequences. Different mechanisms have been tried out to alleviate this. Many LLMs use some form of Multi-Query Attention (MQA), including OpenAI’s GPT-series models, Falcon, SantaCoder, and StarCoder.
MQA is an extension of MHA, where attention computation is replicated multiple times. MQA improves the performance and efficiency of language models for various language tasks. By removing the heads dimension from certain computations and optimizing memory usage, MQA allows for 11 times better throughput and 30% lower latency in inference tasks compared to baseline models without MQA.
Llama 2 and a few other models use Grouped-Query Attention (GQA), which is a practice used in autoregressive decoding to cache the key (K) and value (V) pairs for the previous tokens in the sequence, speeding up attention computation. However, as the context window or batch sizes increase, the memory costs associated with the KV cache size in MHA models also increase significantly. To address this, the key and value projections can be shared across multiple heads without much degradation of performance.
There have been many other proposed approaches to obtain efficiency gains, such as sparse, low-rank self-attention, and latent bottlenecks, to name just a few. Other work has tried to extend sequences beyond the fixed input size; architectures such as transformer-XL reintroduce recursion by storing hidden states of already encoded sentences to leverage them in the subsequent encoding of the next sentences.
The combination of these architectural features allows GPT models to successfully tackle tasks that involve understanding and generating text in human language and other domains. The overwhelming majority of LLMs are transformers, as are many other state-of-the-art models we will encounter in the different sections of this chapter, including models for image, sound, and 3D objects.
As the name suggests, a particularity of GPTs lies in pre-training. Let’s see how these LLMs are trained!
The transformer is trained in two phases using a combination of unsupervised pre-training and discriminative task-specific fine-tuning. The goal during pre-training is to learn a general-purpose representation that transfers to a wide range of tasks.
The unsupervised pre-training can follow different objectives. In Masked Language Modeling (MLM), introduced in BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding by Devlin and others (2019), the input is masked out, and the model attempts to predict the missing tokens based on the context provided by the non-masked portion. For example, if the input sentence is “The cat [MASK] over the wall,” the model would ideally learn to predict “jumped” for the mask.
In this case, the training objective minimizes the differences between predictions and the masked tokens according to a loss function. Parameters in the models are then iteratively updated according to these comparisons.
Negative Log-Likelihood (NLL) and Perplexity (PPL) are important metrics used in training and evaluating language models. NLL is a loss function used in ML algorithms, aimed at maximizing the probability of correct predictions. A low NLL indicates that the network has successfully learned patterns from the training set, so it will accurately predict the labels of the training samples. It’s important to mention that NLL is a value that’s constrained within a positive interval.
PPL, on the other hand, is an exponentiation of NLL, providing a more intuitive way to understand the model’s performance. Small PPL values indicate a well-trained network that can predict accurately, while high values indicate poor learning performance. Intuitively, we could say that a low PPL means that the model is not surprised by the next word. Therefore, the goal in pre-training is to minimize PPL, which means the model’s predictions align more with the actual outcomes.
In comparing different language models, PPL is often used as a benchmark metric across various tasks. It gives us an idea of how well the language model is performing in that a lower PPL indicates the model is more certain of its predictions. Hence, a model with low PPL would be considered better performing than a model with high PPL.
The first step in training an LLM is tokenization. This process involves building a vocabulary, which maps tokens to unique numerical representations so that they can be processed by the model, given that LLMs are mathematical functions that require numerical inputs and outputs.
Tokenizing a text means splitting it into tokens (words or subwords), which then are converted to IDs through a look-up table mapping words in text to corresponding lists of integers.
Before training the LLM, the tokenizer – more precisely, its dictionary – is typically fitted to the entire training dataset and then frozen. It’s important to note that tokenizers do not produce arbitrary integers. Instead, they output integers within a specific range – from 0 to N, where N represents the vocabulary size of the tokenizer.
Definitions
A token is an instance of a sequence of characters, typically forming a word, punctuation mark, or number. Tokens serve as the base elements for constructing sequences of text.Tokenization refers to the process of splitting text into tokens. A tokenizer splits on whitespace and punctuation to break text into individual tokens.Examples
Consider the following text:
“The quick brown fox jumps over the lazy dog!”
This would get split into the following tokens:
[“The”, “quick”, “brown”, “fox”, “jumps”, “over”, “the”, “lazy”, “dog”, “!”]
Each word is an individual token, as is the punctuation mark.
There are a lot of tokenizers that work according to different principles, but common types of tokenizers employed in models are Byte-Pair Encoding (BPE), WordPiece, and SentencePiece. For example, Llama 2’s BPE tokenizer splits numbers into individual digits and uses bytes to decompose unknown UTF-8 characters. The total vocabulary size is 32,000 tokens.
It is necessary to point out that LLMs can only generate outputs based on a sequence of tokens that does not exceed its context window. This context window refers to the length of the longest sequence of tokens that an LLM can use. Typical context window sizes for LLMs can range from about 1,000 to 10,000 tokens.
After pre-training, a major step is how models are prepared for specific tasks either by fine-tuning or prompting. Let’s see what this task conditioning is about!
Conditioning LLMs refers to adapting the model for specific tasks. It includes fine-tuning and prompting:
Fine-tuning involves modifying a pre-trained language model by training it on a specific task using supervised learning. For example, to make a model more amenable to chats with humans, the model is trained on examples of tasks formulated as natural language instructions (instruction tuning). For fine-tuning, pre-trained models are usually trained again using Reinforcement Learning from Human Feedback (RLHF) to be helpful and harmless.Prompting techniques present problems in text form to generative models. There are a lot of different prompting techniques, from simple questions to detailed instructions. Prompts can include examples of similar problems and their solutions. Zero-shot prompting involves no examples, while few-shot prompting includes a small number of examples of relevant problem and solution pairs.These conditioning methods continue to evolve, becoming more effective and useful for a wide range of applications. Prompt engineering and fine-tuning methods will be explored further in Chapter 8, Customizing LLMs and Their Output.
The development of GPT models has seen considerable progress, with OpenAI’s GPT-n series leading the way in creating foundational AI models. A major driver has been the size of models in terms of their parameters; however, other drivers play a role as well.
A foundation model (sometimes known as a base model) is a large model that was trained on an immense quantity of data at scale so that the model can be adapted to a wide range of downstream tasks. In GPT models, this pre-training is done via self-supervised learning.
There has been a recent shift in focus towards exploring alternative approaches to improve model performance on benchmarks like MMLU, beyond simply scaling up the model size. A critical area of focus has been the curation and quality of the training data. Carefully selecting and filtering the training data to ensure its relevance, diversity, and quality can significantly impact the model’s performance, especially on benchmarks that test for a broad range of knowledge and reasoning abilities.
Another key area of innovation has been in model architectures. For example, the Mixtral and Leeroo models employ a mixture-of-experts approach, where different subsets of the model’s parameters are specialized for different tasks, potentially improving performance and computational efficiency.
By exploring these alternative approaches in conjunction with continued scaling efforts, the field is striving to develop language models with even more robust language understanding and reasoning abilities across diverse domains.
The computational requirements and the cost of the model training have been enormous and will probably increase in the future. The computational cost of LLMs is enough to make your wallet weep. But fear not! Before we explore ways to lighten the load, let’s explore what makes these models so weighty in the first place: their size!
The size of the training corpus for LLMs has been increasing drastically. GPT-1, introduced by OpenAI in 2018, was trained on BookCorpus, which has 985 million words. BERT, released in the same year, was trained on a combined corpus of BookCorpus and English Wikipedia, totaling 3.3 billion words. Now, training corpora for LLMs have up to trillions of tokens.
OpenAI has been coy about the technical details of their models; however, information has been circulating that, with about 1.8 trillion parameters, GPT-4 is more than 10x the size of GPT-3. Further, OpenAI was able to keep costs reasonable by utilizing a Mixture of Experts (MoE) model consisting of 16 experts within their model, each having about 111 billion parameters.
Apparently, GPT-4 was trained on about 13 trillion tokens. However, these are not unique tokens since they count repeated presentation of the data in each epoch. Training was conducted for two epochs for text-based data and four for code-based data. For fine-tuning, the dataset consisted of millions of rows of instruction fine-tuning data. Another rumor, again to be taken with a pinch of salt, is that OpenAI might be applying speculative decoding on GPT-4’s inference, with the idea that a smaller model (oracle model) could be predicting the large model’s responses, and these predicted responses could help speed up decoding by feeding them into the larger model, thereby skipping tokens. This is a risky strategy because – depending on the threshold of the confidence of the oracle’s responses – the quality could deteriorate.
The increase in the scale of language models has been a major driving force behind their impressive performance gains, with models like Google’s Gemini continuing to push the boundaries of size and capability. This graph illustrates how LLMs have been growing:
Figure 1.5: LLMs from BERT to GPT-4 – size (number of parameters), and licenses. For proprietary models, parameter sizes are often estimates
In examining the historical progression depicted in the graph, it is evident that LLMs have consistently increased in size, as indicated by the growing number of parameters. This trend aligns with a broader pattern observed in machine learning, where enhancing model performance often involves expanding model size. A paper from 2020 from OpenAI by Kaplan et al. (Scaling laws for neural language models, 2020) discussed scaling laws and the choice of parameters.
They identified a power-law relationship indicating that performance improvements in LLMs are proportional to increases in dataset size and model size. Specifically, to enhance performance by a certain factor, the size of the dataset or the model must be exponentially increased by that factor. For optimal results, both elements should be scaled simultaneously, thus preventing potential bottlenecks in model training and performance.
In addition to dataset and model size, it is essential to consider the training budget, which significantly influences the training process’s efficiency and outcomes. The training budget encompasses factors such as computational power and time allocated for model training. This metric serves as an alternative to measuring training in terms of epochs, allowing more flexibility and precision in determining the optimal point to cease training. Given the complexity and extensive training requirements of LLMs, it can be challenging to pinpoint the precise convergence point. Thus, the training budget plays a crucial role in efficiently managing resources while striving for the highest model performance.
Researchers at DeepMind (An empirical analysis of compute-optimal large language model training; Hoffmann et al., 2022) analyzed the training compute and dataset size of LLMs and concluded that LLMs are undertrained in terms of compute budget and dataset size as suggested by scaling laws. They predicted that large models would perform better if they were substantially smaller and trained for much longer, and – in fact – validated their prediction by comparing a 70-billion-parameter Chinchilla model on a benchmark to their Gopher model, which consists of 280 billion parameters.
However, more recently, a team at Microsoft Research has challenged these conclusions and surprised everyone (Textbooks Are All You Need; Gunasekar et al., June 2023), finding that small networks trained on high-quality datasets can give very competitive performance – their model phi-1-small only comprises 350 million parameters! We’ll discuss this model again in Chapter 6, Developing Software with Generative AI, and we’ll discuss the implications of scaling in Chapter 10, The Future of Generative Models.
We could see new scaling laws linking performance with data quality, and it will be instructive to observe whether model sizes for LLMs keep increasing at the same rate as they have. This is an important question since it determines if the development of LLMs will be firmly in the hands of large organizations. It could be that there’s a saturation of performance at a certain size, which only changes in the approach can overcome. We haven’t seen this leveling off yet, though.
Trained on 300 billion tokens, GPT-3 has 175 billion parameters, an unprecedented size for DL models. GPT-4 is the most recent in the series, though its size and training details have not been published due to competitive and safety concerns. However, different estimates suggest it has between 200 and 500 billion parameters. Sam Altman, the CEO of OpenAI, has stated that the cost of training GPT-4 was more than $100 million.
ChatGPT, launched by OpenAI in November 2022, stands out as a conversational model developed on the foundation of earlier GPT models, notably GPT-3. It is specifically tailored for dialogue, employing a mix of role-playing scenarios by humans and examples to guide the model towards desired behaviors, significantly enhanced by the use of Reinforcement Learning from Human Feedback (RLHF). Instead
