LLMs in Enterprise - Ahmed Menshawy - E-Book

LLMs in Enterprise E-Book

Ahmed Menshawy

0,0
29,99 €

oder
-100%
Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.

Mehr erfahren.
Beschreibung

The integration of large language models (LLMs) into enterprise applications is transforming how businesses use AI to drive smarter decisions and efficient operations. LLMs in Enterprise is your practical guide to bringing these capabilities into real-world business contexts. It demystifies the complexities of LLM deployment and provides a structured approach for enhancing decision-making and operational efficiency with AI.
Starting with an introduction to the foundational concepts, the book swiftly moves on to hands-on applications focusing on real-world challenges and solutions. You’ll master data strategies and explore design patterns that streamline the optimization and deployment of LLMs in enterprise environments. From fine-tuning techniques to advanced inferencing patterns, the book equips you with a toolkit for solving complex challenges and driving AI-led innovation in business processes.
By the end of this book, you’ll have a solid grasp of key LLM design patterns and how to apply them to enhance the performance and scalability of your generative AI solutions.

Das E-Book können Sie in Legimi-Apps oder einer beliebigen App lesen, die das folgende Format unterstützen:

EPUB

Veröffentlichungsjahr: 2025

Bewertungen
0,0
0
0
0
0
0
Mehr Informationen
Mehr Informationen
Legimi prüft nicht, ob Rezensionen von Nutzern stammen, die den betreffenden Titel tatsächlich gekauft oder gelesen/gehört haben. Wir entfernen aber gefälschte Rezensionen.



LLMs in Enterprise

Design strategies, patterns, and best practices for large language model development

Ahmed Menshawy

Mahmoud Fahmy

LLMs in Enterprise

Copyright © 2025 Packt Publishing

All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author nor Packt Publishing or its dealers and distributors will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book.

Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

Portfolio Director: Gebin George

Relationship Lead: Sonia Chauhan

Project Manager: Prajakta Naik

Content Engineer: Aditi Chatterjee

Technical Editor: Rahul Limbachiya

Copy Editor: Safis Editing

Indexer: Hemangini Bari

Proofreader: Aditi Chatterjee

Production Designer: Ajay Patule

Growth Lead: Nimisha Dua

First published: September 2025

Production reference: 1270825

Published by Packt Publishing Ltd.

Grosvenor House

11 St Paul’s Square

Birmingham

B3 1RB, UK.

ISBN 978-1-83620-307-0

www.packtpub.com

To my parents, my wife, Sara, and our kids, Soma, Dawud, Maryam, and Reem, thank you for your patience and support.

– Ahmed

To my wife, Fatma, for her love, patience, and support, and to my daughter, Amina, the light of my life. In loving memory of my father, who continues to inspire me.

– Mahmoud

Contributors

About the authors

Ahmed Menshawy is the Vice President of AI Engineering at Mastercard. He leads the AI Engineering team to drive the development and operationalization of AI products, address a broad range of challenges and technical debts for ML pipelines deployment. He also leads a team dedicated to creating several AI accelerators and capabilities, including serving engines and feature stores, aimed at enhancing various aspects of AI engineering.

Mahmoud Fahmy is a Lead Machine Learning Engineer at Mastercard, specializing in the development and operationalization of AI products. His primary focus is on optimizing machine learning pipelines and navigating the intricate challenges of deploying models effectively for end customers.

About the reviewer

Advitya Gemawat is an ML Engineer at Microsoft, specializing in scalable machine learning systems and Responsible AI (RAI). He has authored publications, holds patents, and received awards from leading venues such as VLDB, ACM SIGMOD, and CIDR. At Microsoft, Advitya has worked with Azure Edge & Platform, Gray Systems Lab, and Windows, building ML and LLM services to enhance developer productivity. He also developed Azure ML’s RAI tooling for computer vision models and Azure OpenAI Evaluations, all of which were released at Microsoft Build (2023–2025). Previously, at VMware, he expanded deep learning features in Apache MADlib. He was a technical reviewer of the Amazon bestseller Ace the Data Science Interview and was recognized as a “25 under 25: Top Data Science Contributor & Thought Leader.” He is also a keynote speaker at technology panels and podcasts.

Subscribe for a free eBook

New frameworks, evolving architectures, research drops, production breakdowns—AI_Distilled filters the noise into a weekly briefing for engineers and researchers working hands-on with LLMs and GenAI systems. Subscribe now and receive a free eBook, along with weekly insights that help you stay focused and informed.

Subscribe at https://packt.link/8Oz6Y or scan the QR code below.

Share Your Thoughts

Once you’ve read LLMs in Enterprise, we’d love to hear your thoughts! Please click here to go straight to the Amazon review page for this book and share your feedback.

Your review is important to us and the tech community and will help us make sure we’re delivering excellent quality content.

Join our Discord and Reddit space

You’re not the only one navigating fragmented tools, constant updates, and unclear best practices. Join a growing community of professionals exchanging insights that don’t make it into documentation.

Stay informed with updates, discussions, and behind-the-scenes insights from our authors.Join our Discord space at https://packt.link/z8ivB or scan the QR code below:

Connect with peers, share ideas, and discuss real-world GenAI challenges. Follow us on Reddit at https://packt.link/0rExL or scan the QR code below:

Your Book Comes with Exclusive Perks – Here’s How to Unlock Them

Unlock this book’s exclusive benefits now

Scan this QR code or go to packtpub.com/unlock, then search this book by name. Ensure it’s the correct edition.

Note: Keep your purchase invoice ready before you start.

Enhanced reading experience with our Next-gen Reader:

Multi-device progress sync: Learn from any device with seamless progress sync.

Highlighting and notetaking: Turn your reading into lasting knowledge.

Bookmarking: Revisit your most important learnings anytime.

Dark mode: Focus with minimal eye strain by switching to dark or sepia mode.

Learn smarter using our AI assistant (Beta):

Summarize it: Summarize key sections or an entire chapter.

AI code explainers: In the next-gen Packt Reader, click the Explain button above each code block for AI-powered code explanations.

Note: The AI assistant is part of next-gen Packt Reader and is still in beta.

Learn anytime, anywhere:

Access your content offline with DRM-free PDF and ePub versions—compatible with your favorite e-readers.

Unlock Your Book’s Exclusive Benefits

Your copy of this book comes with the following exclusive benefits:

Next-gen Packt Reader

AI assistant (beta)

DRM-free PDF/ePub downloads

Use the following guide to unlock them if you haven’t already. The process takes just a few minutes and needs to be done only once.

How to unlock these benefits in three easy steps

Step 1

Keep your purchase invoice for this book ready, as you’ll need it in Step 3. If you received a physical invoice, scan it on your phone and have it ready as either a PDF, JPG, or PNG.

For more help on finding your invoice, visit https://www.packtpub.com/unlock-benefits/help.

Note: Did you buy this book directly from Packt? You don’t need an invoice. After completing Step 2, you can jump straight to your exclusive content.

Step 2

Scan this QR code or go to packtpub.com/unlock.

On the page that opens (which will look similar to Figure 0.1 if you’re on desktop), search for this book by name. Make sure you select the correct edition.

Figure 0.1: Packt unlock landing page on desktop

Step 3

Sign in to your Packt account or create a new one for free. Once you’re logged in, upload your invoice. It can be in PDF, PNG, or JPG format and must be no larger than 10 MB. Follow the rest of the instructions on the screen to complete the process.

Need help?

If you get stuck and need help, visit https://www.packtpub.com/unlock-benefits/help for a detailed FAQ on how to find your invoices and more. The following QR code will take you to the help page directly:

Note: If you are still facing issues, reach out to [email protected].

Part 1

Background and Foundational Concepts

In Part 1 of this book, we build a solid foundation by introducing the core concepts of large language models (LLMs) and their strategic role in the enterprise. We explore how these models are transforming business processes and identify the key challenges they present. This part makes a strong case for mastering LLM design patterns to ensure scalability, security, and success..

This part contains the following chapters:

Chapter 1, Introduction to Large Language ModelsChapter 2, LLMs in Enterprise: Applications, Challenges, and Design PatternsChapter 3, Advanced Fine-Tuning Techniques and Strategies for Large Language ModelsChapter 4, Retrieval-Augmented Generation PatternChapter 5, Customizing Contextual LLMs

1

Introduction to Large Language Models

Artificial intelligence (AI) refers to computer systems designed to augment human intelligence, providing tools that enhance productivity by automating complex tasks, analyzing vast amounts of data, and assisting with decision-making processes. Large language models (LLMs) are advanced AI applications capable of understanding and generating human-like text. These models function based on the principles of machine learning, where they process and transform vast datasets to learn the nuances of human language. A key feature of LLMs is their ability to generate coherent, natural-sounding outputs, making them an essential tool for building applications ranging from automated customer support to content generation and beyond.

LLMs are a subset of models in the field of natural language processing (NLP), which is itself a critical area of AI. The field of NLP is all about bridging the gap between human interaction and computer understanding, allowing a seamless interaction between humans and machines. LLMs are at the forefront of this field due to their ability to handle a broad array of tasks that require a deep understanding of language, such as answering questions, summarizing documents, translating text, and even creating original content.

The architecture most associated with modern LLMs is the transformer architecture, as shown in Figure 1.1 from the “Attention is All You Need” paper published in 2017. This architecture utilizes mechanisms called attention layers to weigh the relevance of all parts of the input data differently, which is a significant departure from previous sequence-based models that processed inputs in order.

This allows LLMs to be more context-aware and responsive in conversation-like scenarios.

Figure 1.1: The transformer model architecture. Image credit: 1706.03762 (arxiv.org)

The main purpose of this chapter is to dive into the rapidly changing world of LLMs. We will explore the historical development of these models, tracing their origins from basic statistical methods to the sophisticated systems we see today. This journey will highlight key technological advancements that have significantly influenced their evolution. Starting with the early days of simple algorithms that could count word frequencies and recognize basic patterns in text, we will see how these methods laid the foundation for more complex approaches.

As we progress, we will discuss the introduction of machine learning techniques that allow computers to learn from data and improve their text predictions. Finally, we will delve into the breakthrough moments that led to the creation of modern LLMs, such as the use of neural networks and the development of transformer architectures. By understanding this history, we can better appreciate how far LLMs have come and the potential they hold for the future. It also lays the foundation for everything you will learn throughout the rest of this book.

By the end of this chapter, you should have a clear understanding of:

The historical context and technological progression of language models (LMs)The common recipe for training an LLM assistant like ChatGPT and its different stagesThe current generative capabilities and limitations of these models

Let’s begin this chapter by exploring the historical context and evolution of LMs, particularly addressing the common misconception that these models are a recent innovation invented exclusively by OpenAI.

Historical context and evolution of language models

There are several misconceptions surrounding LMs, notably the belief that they were invented by OpenAI. However, the idea of LMs is not just a few years old; it is several decades old. As illustrated in Figure 1.2, the concept behind some LMs is quite intuitive; given an input sequence, the task of the model is to predict the next token:

Figure 1.2: LMs and prediction of the next token given the previous words (context)

To truly appreciate the sophistication of modern LMs, it’s essential to explore the historical evolution and the diverse range of disciplines from which they draw inspiration, all the way up to the recent transformative developments we are currently witnessing.

Early developments

The origins of LMs can be traced back several decades, originating in the foundational work on statistical models for NLP. Early LMs primarily utilized basic statistical methods, such as n-gram models. These models were simple yet groundbreaking, providing the basis for more complex systems.

In the 1950s and 1960s, the focus was on developing algorithms that could perform tasks like automatic translation between languages and information retrieval, which are inherently based on processing and understanding language. These early efforts laid the groundwork for subsequent advancements in computational linguistics, leading to the first wave of rule-based systems in the 1970s and 1980s. These systems attempted to encode the grammar and syntax rules of languages into software, aiming for a more structured approach to language understanding.

Evolution over time

As datasets grew, fueled by the birth of the internet and the increased collection of data, the limitations of rule-based systems became apparent. These systems struggled with scalability, generalization, and flexibility, leading to a pivotal shift towards machine learning-based approaches in the 1990s and early 2000s. During this period, machine learning models such as decision trees and Hidden Markov Models (HMMs) started to dominate the field due to their ability to learn language patterns from data without explicit programming of grammar or syntax rules.

Although neural networks were recognized as a powerful tool, their practical application was initially limited by computational constraints. It wasn’t until the mid to late 2000s, when computational power significantly increased, that building larger and more complex neural networks became feasible. This computational advancement, combined with the growing availability of large datasets, enabled the development of neural networks with multiple layers, leading to the modern deep learning techniques that drive today’s sophisticated LLMs. These models offer greater adaptability and accuracy in language tasks, transforming the landscape of NLP.

The introduction of machine learning into language modeling culminated in the development of deep learning techniques in the 2010s, particularly with the advent of Recurrent Neural Networks (RNNs), Long Short-Term Memory networks (LSTMs), and Gated Recurrent Units (GRUs).

These architectures were better suited to handling sequences, such as sentences and paragraphs, because they could remember information for long periods, a critical requirement for understanding context in text. Figure 1.3 shows some of these sequence models and their architecture progression:

Figure 1.3: Evolution of different sequence models

Quick tip: Need to see a high-resolution version of this image? Open this book in the next-gen Packt Reader or view it in the PDF/ePub copy.

The next-gen Packt Reader and a free PDF/ePub copy of this book are included with your purchase. Scan the QR code OR visit packtpub.com/unlock, then use the search bar to find this book by name. Double-check the edition shown to make sure you get the right one.

As we mentioned in the previous sections, the real breakthrough came with the development of the transformer model in 2017, which revolutionized LMs with its use of self-attention mechanisms. Unlike earlier models, such as RNNs and LSTMs, which processed text sequentially and often struggled with long-range dependencies, transformers could process all words in a sentence simultaneously. This parallel processing capability enabled transformers to assess and prioritize the significance of various words within a sentence or document, regardless of their position. This innovation resulted in a more nuanced understanding and generation of text, allowing transformers to capture context and relationships between words more effectively. The self-attention mechanism also made it easier to train on large datasets and leverage parallel computing resources, leading to significant improvements in performance and scalability. This architecture underpins the current generation of LLMs, including OpenAI’s generative pre-trained transformers series, and represents a substantial advancement over previous models.

While generative pre-trained transformers (GPTs) are a type of LLM and a prominent framework for generative artificial intelligence, LLM is a broader term encompassing any large-scale neural network trained to understand and generate human language, GPTs specifically refer to models based on the transformer architecture. GPTs are pre-trained on large datasets of unlabeled text and can generate novel human-like content. Introduced by OpenAI in 2018, the GPT series has evolved through sequentially numbered models, each significantly more capable than the previous one due to increased size and training. These models serve as the foundation for task-specific GPT systems, including fine-tuned models for following instructions, which power services like ChatGPT.

Computational advances and increasing data availability

As we explore the historical evolution of LMs, it’s crucial to acknowledge the significant role played by advancements in computational power and the expansion of available data. Over the past few decades, these two factors have been pivotal in enhancing the sophistication and capabilities of LMs. Let’s look at each in turn.

Advancements in computational power

The increase in computational power, particularly through the development of more powerful CPUs and GPUs, has allowed researchers and developers to train larger models with millions or even billions of parameters. These high-performance processors can perform the vast number of calculations needed for training deep learning models in a fraction of the time previously required. This has been essential for experimenting with complex architectures like deep neural networks and transformers, which require substantial computational resources to train effectively.

Availability of large datasets

Parallel to hardware improvements, the digital age has seen an exponential increase in the amount of data available. The internet has become a treasure trove of textual data, from books and articles to blogs and social media posts. This plethora of data provides the diverse and extensive datasets necessary for training LMs. By learning from a broad range of language use and contexts, models can better predict and generate human-like text, capturing nuances and variations in language that were previously difficult to achieve.

These computational and data resources have collectively enabled the development of more advanced LMs that are not only more accurate but also more contextually aware. This advancement supports a wide array of applications, from simple automated responses to complex dialogue systems capable of maintaining coherent and contextually appropriate conversations over extended interactions.

LLMs and transforming user interfaces into natural conversations

Before the era of LLMs, there was a significant issue with how users interacted with LMs, which was mainly that the user interface was not intuitive or user-friendly. Essentially, the way people could communicate with these models was limited.

What really changed the game with LLMs was the improvement of this user interface and the instruction dataset, as shown in Figure 1.4 (for clarity, the text boxes beneath the text Instructions fine-tuned on many tasks are examples of what might make up an instruction dataset). This transformation allowed everyday users to interact with AI-powered assistants in a way that feels natural, much like having a conversation with another human.

Figure 1.4: Using instruction data to fix the LLM interface

Here’s how this was achieved:

Intuitive prompts: The new approach involves prompting the model in specific, human-like ways. This means you can ask the model questions or give it commands in plain language, and it generates a text response to address the user query. This is like teaching the model to start a conversation based on a simple cue or question.Instruction fine-tuning: This step involves adjusting the model based on specific instructions or corrections. Essentially, you help the model to understand tasks better by providing examples of what you expect. This doesn’t require technical knowledge; it’s like giving feedback to a person learning a new skill.Simplified alignment: A method called reinforcement learning from human feedback (RLHF) is used to better align LLMs with human expectations. By using RLHF, input is gathered directly from human interactions. Labelers provide examples of desirable responses and rate the outputs generated by the LLMs based on these prompts. This feedback is then used to fine-tune the model, enhancing its ability to produce more helpful and appropriate responses in everyday interactions.

Given the improvements made in context understanding, fine-tuning, and alignment, AI assistants can now engage in conversations just like a human would. By using the context of the conversation and the fine-tuning processes, along with alignment techniques such as RLHF, the AI generates responses that are relevant and feel surprisingly human.

Having explored the evolution of LMs more widely and how recent developments have primarily focused on making them larger, more powerful, and improving user interaction, let’s explore the evolution of LLM architectures in the past few years.

Evolution of LLMs architectures

The development of LM architectures has undergone a transformative journey, tracing its origins from simple word embeddings to sophisticated models capable of understanding and generating multimodal content. This progression is elegantly depicted in Figure 1.5 by the LLM Evolutionary Tree that starts from foundational models before 2018, such as FastText, GloVe, and Word2Vec, and extends to the latest advancements, like the LLaMA series and Google’s Bard.

Figure 1.5: A timeline of LLM development. Image credit: https://github.com/Mooler0410/LLMsPracticalGuide

Let’s look at this evolution in more detail. We’ll explore the various stages of this evolution, starting with embedding models and how they represent text as vector representations that preserve the semantic meaning of words and sentences. We will then discuss the rise of pre-trained models and their multimodal variants.

Early foundations – word embeddings

Initially, models like FastText, GloVe, and Word2Vec represented words as vectors in high-dimensional space, capturing semantic and syntactic similarities based on their co-occurrence in large text corpora. These embeddings provided a static representation of words, serving as the backbone for many early NLP applications.

Breakthrough with transformers

The introduction of the transformer architecture in 2017 marked a significant shift in LM design. Unlike their predecessors, transformers utilize a mechanism known as self-attention to weigh the influence of different words within a sentence, regardless of their position. This architecture allowed models to capture complex word relationships and dependencies, improving their ability to understand context and meaning significantly.

The rise of pre-trained models

Building on the transformer architecture, pre-trained models like OpenAI’s GPT series and BERT from Google revolutionized NLP by learning general language patterns from vast amounts of text. These models can then be fine-tuned for specific tasks, achieving state-of-the-art results in areas such as summarization, question answering, and language translation.

Multimodality and beyond

The latest evolution in LLM architectures involves the integration of multimodal capabilities, as shown in Figure 1.6. Models are no longer limited to processing text; they can now understand and generate information across various forms, such as images, audio, and video.

For instance, DALL-E, which was invented by OpenAI, extends the GPT-3 architecture to generate images from textual descriptions, showcasing the creative potential of LLMs.

Figure 1.6: Multimodality

Mixture of experts – revolutionizing language model architectures

The concept of the Mixture of Experts (MoE) has emerged as a significant breakthrough in the field of LM architectures, particularly highlighted by its application in high-profile models like MistralAI’s Mixtral8x7b. Let’s look into what exactly MoE is and how it works.

Core concepts of MoE

MoEs represent a paradigm shift in neural network architecture by introducing sparsity and specialized processing. This model architecture optimizes computational resources by activating only relevant parts of the network, known as “experts,” depending on the input data. Each expert specializes in different segments of the data, much like teachers who specialize in specific subjects.

The building blocks of MoEs

The fundamental elements of an MoE include:

Sparse MoE layers: These layers replace traditional dense feedforward networks and contain a set number of experts.Gate network or router: This determines which input tokens are processed by which experts, optimizing the model’s performance by directing tasks to the most qualified neural network segments.

This structure enhances the efficiency of the model and significantly speeds up training and inference processes compared to denser models with similar parameters.

Historical context and development

The concept of MoEs isn’t new and dates to the 1991 paper by Robert et al, “Adaptive Mixture of Local Experts.” Over the years, developments in this field have evolved from simple ensemble techniques to complex, hierarchical structures capable of handling extensive and varied datasets effectively.

Practical applications and future directions

Today, MoEs are integral to the training of some of the most widely used LLMs, offering a scalable solution that can handle increasingly complex tasks. They are also being explored in fields beyond NLP, such as computer vision.

MoEs mark a significant step towards more dynamic, efficient, and powerful machine learning models. As we continue to push the boundaries of what AI can achieve, MoEs play a pivotal role in making AI more accessible and sustainable, paving the way for future innovations that could transform every sector of society.

Now that we’ve observed the rapid progression and evolution within the LLM space, along with the vast number of LLMs released in this short period (as illustrated in the LLM evolutionary tree above), let’s explore the common training recipe used to train most GPT assistants like ChatGPT. We’ll examine how they progressed through the various stages of this training recipe to become deployable assistants with an enhanced interface for interaction. This development allows natural, template-free interactions without requiring complex commands to perform specific tasks with the LLM.

GPT assistant training recipe

Before diving into the specifics of how GPT assistants like ChatGPT are developed, it’s essential to understand the foundational elements and methodologies involved in training these advanced LMs. This is because many of the steps involved here are mirrored in the later fine-tuning steps, so understanding these steps can help you gain clarity on how you might better prepare your business data for LLM integration. The process includes several stages, each contributing to the model’s ability to comprehend and generate human-like text.

Figure 1.7 outlines the standard training recipe used to develop a GPT assistant, such as ChatGPT. This process, divided into four different stages, evolves the transformer neural network into an advanced AI capable of generating profound human-like text. Understanding the process of training such models is crucial for effectively understanding the type of data used in each stage, as well as what it might take to fine-tune such models with your domain-specific data.

Initially, these models begin as basic foundational models capable of completing text. However, through a series of additional training stages, they evolve into highly capable assistants that can generate helpful and appropriate human-like text. This evolution involves several key stages, starting with the creation of a base model using internet-scale data, refining it through supervised fine-tuning, enhancing it further with reward modeling, and finally optimizing it via reinforcement learning. Each stage is designed to improve the model’s performance and adaptability to real-world tasks.

Figure 1.7: Training stages of GPT assistants

Let’s start with the first and most computationally intensive stage, which is for building the base model from internet scale data.

Building the base model

The first stage in the training of LLMs such as GPTs is the creation of a robust base model. This foundational phase is the most computationally intensive and resource-demanding part of the model’s development. Here, we’ll break down this stage into its critical components and discuss each in detail.

Data collection and assembly

The journey begins with gathering an immense corpus of text data. For LLMs like GPT-3 and its successors, as well as the Llama series, this typically involves compiling datasets from diverse sources such as CommonCrawl, Wikipedia, books, and more specialized collections like GitHub or Stock Exchange archives. This varied dataset ensures that the model has exposure to a wide range of language use cases and domains.

Figure 1.8 shows the strategic composition of datasets aimed at developing a model with a comprehensive linguistic understanding. By training on such a diverse set of texts, the LLM is well equipped to handle a variety of tasks, from answering questions to generating creative content and interpreting technical documents.

Figure 1.8: Data used to train the Llama model (source: LLaMA: Open and Efficient Foundation Language Models)

Data preprocessing – tokenization

Tokenization is the process where raw text is split into smaller units called tokens. This is typically achieved using an algorithm like Byte Pair Encoding (BPE), which iteratively combines the most common pairs of characters or sub-words until it achieves a certain vocabulary size. This method ensures that common words or phrases are kept intact while less common ones are broken down into smaller units, optimizing the model’s ability to process and understand a wide range of texts.

After tokenization, each token is assigned a unique integer. This step converts the textual data into a sequence of integers, making it suitable for processing by neural network models, which require numerical input. This mapping is direct: each distinct token corresponds to a unique number in a predefined list, forming the model’s vocabulary.

Figure 1.9 illustrates this two-step tokenization phase:

Figure 1.9: Tokenization using OpenAI’s tokenization tool

Model configuration

Setting the right hyperparameters is crucial for training a successful base model. Hyperparameters are configuration settings used to control the training process of the model and can significantly impact its performance. Hyperparameters include:

Vocabulary size: This refers to the number of unique tokens (words, subwords, or characters) that the model can recognize. Typically, the vocabulary size is in the range of tens of thousands of tokens. A larger vocabulary allows the model to understand and generate a wider variety of text but also increases the computational complexity.Context length: This is the length of the text sequence the model considers when making predictions. Modern LLMs handle sequences ranging from 2,000 to even 1,000,000 tokens long. For example, Google’s Gemini 1.5 Pro is the first LLM released with a 1,000,000-token context window. Longer context lengths enable the model to capture more context and dependencies in the text, which can improve the quality of the generated output but also require more memory and processing power.

Model architecture details are another consideration. They have several key components:

Number of transformer layers: This determines the depth of the model. More layers generally allow the model to learn more complex patterns but also increase training time and computational requirements.Number of attention heads: Attention heads are part of the self-attention mechanism that enables the model to focus on different parts of the input sequence. More attention heads can improve the model’s ability to understand complex relationships in the data.Size of each layer: This refers to the number of neurons in each layer. Larger layers can capture more information but require more computational resources.

By carefully tuning these hyperparameters, businesses can optimize the model’s performance for specific tasks and datasets. Proper selection of hyperparameters can lead to significant improvements in the model’s ability to understand and generate human-like text.

Computational requirements

Training an LLM like GPT-3 or LLaMA involves a significant allocation of computational resources, usually entailing thousands of GPUs running continuously for weeks. This stage consumes the bulk of the computational budget, often costing several million dollars.

Training process

The actual training process involves feeding batches of tokenized text into the model and adjusting the model’s parameters based on its prediction accuracy. The model learns to generate the next token in the sequence by understanding the context provided by the tokens appearing before it in the same row. This training is iterative, with the model’s predictions becoming progressively more accurate as it processes more data.

Building base model recap

Think of the pretraining process as teaching a new language to someone by showing them lots of example sentences. Now that we’ve seen this in some detail, here’s a simple way to remember what’s happening during this phase:

Breaking down text into piece: First, we take large amounts of written text (like books, articles, etc.) and break them down into smaller pieces, which we call “tokens.” These tokens are like individual words or parts of words.Vocabulary size: Imagine that each token is a word in a dictionary. In our model’s training, we might have a dictionary (vocabulary) of 50,257 different words or word pieces. This number represents all the possible tokens the model can use to understand and generate language.Organizing these pieces: We then organize these tokens into batches, like sorting them into different trays where each tray contains a specific number of tokens arranged in a particular order. Instead of processing individual tokens one by one, we process groups of tokens together in batches. As shown in Table 1.1, we decided to process 5 rows at a time, with each row consisting of 10 tokens, which is our context length. This batching process allows more efficient computation and better utilization of resources during training.

1

2

3

4

5

6

7

8

9

10

1

20

305

45

100

856

34

2

901

99

1

2

5

421

32

900

401

310

2

702

98

1

3

80

209

76

11

31

64

2

52

55

1

4

90

55

7

2

801

305

201

2

450

901

Table 1.1: Training batch for building the base model

Feeding the model: These batches are then fed into a neural network algorithm transformer, which is designed to learn patterns in language. The system looks at each batch and tries to predict what word (or piece of word) comes next based on the ones it’s currently looking at.

During the training process, the model learns in a supervised manner by predicting the next word in a sequence based solely on the preceding words. Each cell only sees cells in its row and only cells before it, which means it doesn’t have access to future words. To train the model, we mask certain words at the end of each row, making them the target outputs for the model to predict. The model’s predictions are then compared to these masked words, and the difference (or error) between the predicted and actual words is calculated. This difference, often referred to as loss, is minimized over multiple training iterations to improve the model’s accuracy. By continuously reducing this difference, the model learns to generate more accurate and coherent text.

Once the model is trained, it can be shown a sequence of words and then asked to produce the next word in the sequence, as shown in Figure 1.10. This predicted word is then injected back into the input sequence. The input is shifted by one word, so the word that was just predicted by the model now becomes part of the input used to predict the next word. This process is repeated, with the model continuing to predict the next word, shift the input, and use its own predictions as new inputs. The sequence generation continues until an end token is generated or the text limit is reached, such as 4,096 tokens.

This type of token generation, where each word is generated based on the preceding context (previous words), is characteristic of autoregressive generative models. In these models, each new token is produced by conditioning on the sequence of tokens generated so far, making them highly effective for tasks that require sequential prediction, such as text generation and language modeling.

Figure 1.10: Auto-regressive generative models

Predicting the next token: During the training, the model generates probabilities over its vocabulary size for the next token based on the context it sees. For instance, if the model is looking at the token ‘dog’ and trying to predict what comes next, it calculates the likelihood of every possible token (from its dictionary of 50,257 tokens) being the next word.

As per Table 1.2, the dark grey cell highlights a randomly selected cell, and the light gray ones will be the context that it uses to predict its target next token in the sequence after the randomly selected cell.

1

2

3

4

5

6

7

8

9

10

1

20

305

45

100

856

34

2

901

99

1

2

5

421

32

900

401

310

2

702

98

1

3

80

209

76

11

31

64

2

52

55

1

4

90

55

7

2

801

305

201

2

450

901

Table 1.2: Training batch with target and context highlighted

This batch will be fed to the transformer model, which will generate the next token, as shown in Figure 1.11:

Figure 1.11: Pre-training step of the base model

Learning from mistakes: As the transformer makes predictions, it checks if what it guessed is right or wrong. If it’s wrong, it adjusts itself to be more accurate next time. This adjustment is like tweaking its understanding bit by bit.Repeating the process: This process repeats with many different batches of tokens, gradually helping the transformer get better at predicting. It’s like practicing a language over and over, starting from simple phrases to more complex sentences.Getting smarter over time: Over time, and after seeing millions of examples, the transformer learns a robust way to use language. It becomes capable of understanding and generating text that makes sense, all by learning from the patterns it observed in the training phase.

Outcome – the pre-trained base model

After months of training, the result is a pre-trained base model capable of understanding and generating text based on the training it received. However, this model is generic and not yet specialized for particular tasks or styles of interaction. The next steps in the training recipe will involve refining this base model into a more focused assistant. This is where the GPT base model goes through the next stages to become ChatGPT, through stages like supervised fine-tuning and reinforcement learning, which we will explore in the following sections.

This initial stage lays the groundwork for all subsequent enhancements and is critical for ensuring the model’s broad understanding of language, which is essential for its effectiveness in more specialized tasks later on. By understanding this phase deeply, developers can better appreciate the complexities involved in creating LLMs that are both powerful and versatile.

Supervised fine-tuning stage

The second major stage in the training recipe for GPT assistants is supervised fine-tuning (SFT). After establishing a robust base model through extensive pre-training, the SFT stage refines this model to produce outputs that are specifically tailored to perform well on predefined tasks or respond appropriately in assistant-like interactions.

The primary goal of this stage is to transition from a general-purpose LM, capable of understanding and generating language on a broad scale, to a specialized model that can understand and respond to specific prompts or queries effectively. This transition involves training the model on a curated dataset that represents the kinds of interactions it will handle in deployment. Let’s look at the steps involved in building the SFT model.

Data collection – high-quality, task-specific

Unlike the data used for pre-training, which is vast and varied, the data for SFT is much more focused and of higher quality. It typically consists of pairs of prompts and ideal responses. These datasets are usually smaller but crafted with precision, often involving human contractors who curate and label the data meticulously to ensure relevance and accuracy. The quality and specificity of this data are crucial, as they directly influence the model’s performance on its intended tasks.

Training process – refinement and specialization

During SFT, the model’s existing knowledge and capabilities are honed and expanded to include the ability to handle specific types of queries and generate appropriate responses. This process involves:

Adjusting to new inputs: The model learns to recognize and prioritize information that’s relevant to the tasks it will performOptimizing responses: Through iterative training, the model adjusts its parameters to produce responses that closely match the provided ideal answers

Model adjustments – fine-tuning hyperparameters

Fine-tuning involves adjusting several hyperparameters, such as learning rates or the number of training epochs, to optimize the training process without overfitting. The adjustments are crucial as they need to be carefully managed to maintain the general language understanding acquired during pre-training while adapting the model to perform well on specialized tasks.

Output – the SFT model

The outcome of this stage is an SFT model, which is an LLM that not only understands a wide range of language inputs but can also engage in specific interactions with high accuracy and relevance. This model is better suited to tasks such as customer support, content creation, or even complex reasoning in a narrower domain compared to the base model.

Reward modeling stage

The third stage in the training of GPT assistants involves reward modeling. After the supervised fine-tuning has tailored the model’s initial responses, the reward modeling stage provides a framework for refining these responses based on their desirability or utility. This stage is crucial for aligning the model’s outputs with human values and preferences, essentially teaching the model what is considered a “good” or “bad” response in various contexts.

The primary objective of reward modeling is to develop a model that can evaluate the quality of its own outputs. This evaluation isn’t based just on linguistic correctness or fluency but also on how well the responses meet the criteria of being useful, accurate, and aligned with ethical guidelines. This process involves creating a reward model that assigns scores to responses based on their perceived value. Now, let’s break down the steps involved in building the reward model.

Data collection for comparison

Unlike earlier stages, which may use individual responses, reward modeling often involves comparisons between multiple possible responses to the same prompt.

Data for this stage is gathered by presenting the same prompt to the model multiple times, each time generating different responses, which are then evaluated by human reviewers.

Human judgment and scoring

Human reviewers play a crucial role at this stage. They are presented with sets of responses and asked to rank them based on criteria such as relevance, coherence, and appropriateness.

These rankings are used to teach the model which types of responses are preferred, effectively “training” the reward model.

Integration with the neural network

A special component, often a smaller neural network, is trained to predict the reward scores for each response generated by the GPT model. This reward predictor is trained using the rankings provided by human reviewers.

The training involves adjusting the reward predictor to forecast higher scores for responses deemed better by humans and lower scores for less desirable ones. The amount of training required is very subjective; usually, you will start with a pre-chosen amount of data, assess, and then decide if more training is required.

Outcome – a trained reward model

The reward model does not generate responses itself but evaluates the quality of responses generated by the main LM. It acts as a judge, guiding the main model’s learning process by providing feedback on what kinds of responses should be more likely in future interactions.

Reinforcement learning stage

The fourth stage in the training recipe for GPT assistants is reinforcement learning (RL), which utilizes the foundation built by the earlier stages: pretraining, supervised fine-tuning, and reward modeling. This stage is pivotal for refining the model to produce high-quality, contextually appropriate responses aligned with specific performance metrics.

The main goal of the reinforcement learning stage is to fine-tune the LM’s responses based on a reward system developed in the previous stage. This is done to maximize the probability of the model producing responses that are considered high-quality according to the established reward criteria.

Integration of the reward model

The reward model, trained in the previous stage, assesses the quality of responses generated by the LM. These assessments are used to guide the reinforcement learning process.

Essentially, the reward model provides a “score” or feedback for each response, indicating how well it aligns with the desired outcome.

Training process

During reinforcement learning, the LM generates multiple responses to the same prompt.

Each response is evaluated by the reward model, which assigns a score based on the pre-established criteria (e.g., relevance, coherence, safety).

The LM is then updated to increase the likelihood of generating responses that receive higher scores in the future.

Optimization techniques

Common techniques used in this stage involve adjusting the model’s responses based on feedback to improve their quality. By iteratively refining the model, these techniques ensure that the model becomes more effective at generating desired outputs.

Outcome – a reinforced learning model

The outcome of this stage is a model that not only understands the general structure of language (from pretraining) and can generate contextually appropriate responses (from supervised fine-tuning) but also excels in delivering responses that meet specific qualitative criteria. This model is typically more refined and aligned with user expectations and real-world applications.

Now that we’ve discussed the common training recipe used to train GPT assistants, including the varied data and computational requirements for each stage until we obtain a deployable instruct model for user interaction, it’s crucial to highlight some of the realities and myths surrounding LLMs and assess whether this transformative technology with such impact truly represents an iPhone moment for the AI industry.

Decoding the realities and myths of LLMs

LLMs like OpenAI’s GPT series have sparked widespread intrigue and debate across the tech world and beyond. While they are often seen as groundbreaking advancements, there are numerous misconceptions and exaggerated claims surrounding their capabilities and origins. This section aims to clarify these misunderstandings by addressing common myths and examining their real-world applications and limitations.

From their early statistical underpinnings to the sophisticated neural networks we see today, as you saw earlier in this chapter, the evolution of LMs has been a collaborative and incremental process, contrary to the notion that they suddenly emerged from a single innovator or institution.

We’ll start by discussing the critical insights of Ada Lovelace, which remain profoundly relevant in understanding the fundamental nature of these models, as well as the limitations that come with their impressive capabilities.

Ada Lovelace’s insights

Ada Lovelace, Figure 1.12, celebrated as the first computer programmer, provided early and profound insights into the nature of computing machines that are still relevant in today’s discussions about artificial intelligence and, specifically LLMs. In her notes from 1843 on Charles Babbage’s Analytical Engine, Lovelace posited that the machine “has no pretensions to originate anything,” but can only do “whatever we know how to order it to perform.” This observation highlights a fundamental limitation of computational systems: their reliance on human input for their operations and the boundaries of their creativity.

Lovelace’s assertion is particularly pertinent when examining the capabilities and limitations of current LLMs. Despite their ability to generate text that can seem original and insightful, these models are fundamentally limited to manipulating and recombining existing information within the data they have been trained on. They do not possess the ability to create genuinely novel ideas or concepts beyond their training data’s scope. This characteristic aligns closely with Lovelace’s views, underscoring a critical distinction between human cognitive abilities and machine operations.

Figure 1.12: Ada Lovelace

Moreover, this understanding of machine limitations is crucial when evaluating the output of LLMs. For instance, while these models can produce content that appears new at a superficial level, their output is often an echo of patterns and biases present in their training material. This has important implications for how we deploy and interact with LLMs, especially in fields requiring creativity and critical thinking. It also brings to the fore the ethical considerations of using such models, particularly concerning the transparency of their derivations and the potential propagation of existing biases.

Failures in simple tasks

While LLMs like GPT-4 impress with their ability to generate human-like text, their performance on seemingly simple tasks often reveals significant limitations as shown in Figure 1.13. These failures support Ada Lovelace’s argument that machines cannot originate things by themselves and illustrate the inherent limitations of current AI systems.

For example, LLMs can struggle with tasks requiring basic common sense or real-world knowledge that humans typically find trivial. A common failure mode is the generation of plausible-sounding but factually incorrect or nonsensical answers to simple questions, such as misunderstanding the physical properties of objects (e.g., “Can a mouse eat a whole car?” might receive a response that doesn’t immediately dismiss the impossibility). These errors stem from the models’ reliance on patterns in data rather than a true understanding of the world.

Figure 1.13: LLMs failure in simple tasks

These examples underscore the challenge of developing AI systems that truly understand and interact with the world as humans do, pointing to a gap that remains in achieving truly intelligent systems.

Limitations compared to human intelligence

LMs, especially the auto-regressive type used in many AI systems, are powerful tools that can predict the next word in a sequence of text. However, these models have several important limitations that affect how they can be used and the types of tasks they can perform effectively. Let’s look at some of them now:

Increasing errors over time: Imagine you’re trying to predict the next word in a sentence, and each time you try, there’s a small chance you’ll get it wrong. As you keep predicting more words, these small ratios of error add up, and the likelihood of making a mistake increases. This means that the longer the piece of text you want to generate, the higher the chance of errors creeping in. This is like trying to walk in a straight line while blindfolded; the further you go, the more likely you are to veer off course.Fixed thinking process: When these AI models create text, they do so one word at a time and use a fixed amount of computing power for each word. This is like having only a few seconds to think about what word to say next in a conversation, no matter how complex the topic. If we want the model to “think” harder or more deeply about the next word, we can’t simply tell it to; we can only make it generate more words, which is a roundabout way of trying to get deeper thoughts from it. This fixed process limits the model’s ability to plan or think ahead.