29,99 €
Artificial intelligence has undergone rapid advancements, and Large Language Models (LLMs) are at the forefront of this revolution. This LLM book offers insights into designing, training, and deploying LLMs in real-world scenarios by leveraging MLOps best practices. The guide walks you through building an LLM-powered twin that’s cost-effective, scalable, and modular. It moves beyond isolated Jupyter notebooks, focusing on how to build production-grade end-to-end LLM systems.
Throughout this book, you will learn data engineering, supervised fine-tuning, and deployment. The hands-on approach to building the LLM Twin use case will help you implement MLOps components in your own projects. You will also explore cutting-edge advancements in the field, including inference optimization, preference alignment, and real-time data processing, making this a vital resource for those looking to apply LLMs in their projects.
By the end of this book, you will be proficient in deploying LLMs that solve practical problems while maintaining low-latency and high-availability inference capabilities. Whether you are new to artificial intelligence or an experienced practitioner, this book delivers guidance and practical techniques that will deepen your understanding of LLMs and sharpen your ability to implement them effectively.
Das E-Book können Sie in Legimi-Apps oder einer beliebigen App lesen, die das folgende Format unterstützen:
Seitenzahl: 656
Veröffentlichungsjahr: 2024
Forewords
Contributors
Join our book’s Discord space
Preface
Who this book is for
What this book covers
To get the most out of this book
Get in touch
Making the Most Out of This Book – Get to Know Your Free Benefits
Understanding the LLM Twin Concept and Architecture
Understanding the LLM Twin concept
What is an LLM Twin?
Why building an LLM Twin matters
Why not use ChatGPT (or another similar chatbot)?
Planning the MVP of the LLM Twin product
What is an MVP?
Defining the LLM Twin MVP
Building ML systems with feature/training/inference pipelines
The problem with building ML systems
The issue with previous solutions
The solution – ML pipelines for ML systems
The feature pipeline
The training pipeline
The inference pipeline
Benefits of the FTI architecture
Designing the system architecture of the LLM Twin
Listing the technical details of the LLM Twin architecture
How to design the LLM Twin architecture using the FTI pipeline design
Data collection pipeline
Feature pipeline
Training pipeline
Inference pipeline
Final thoughts on the FTI design and the LLM Twin architecture
Summary
References
Tooling and Installation
Python ecosystem and project installation
Poetry: dependency and virtual environment management
Poe the Poet: task execution tool
MLOps and LLMOps tooling
Hugging Face: model registry
ZenML: orchestrator, artifacts, and metadata
Orchestrator
Artifacts and metadata
How to run and configure a ZenML pipeline
Comet ML: experiment tracker
Opik: prompt monitoring
Databases for storing unstructured and vector data
MongoDB: NoSQL database
Qdrant: vector database
Preparing for AWS
Setting up an AWS account, an access key, and the CLI
SageMaker: training and inference compute
Why AWS SageMaker?
Summary
References
Join our book’s Discord space
Data Engineering
Designing the LLM Twin’s data collection pipeline
Implementing the LLM Twin’s data collection pipeline
ZenML pipeline and steps
The dispatcher: How do you instantiate the right crawler?
The crawlers
Base classes
GitHubCrawler class
CustomArticleCrawler class
MediumCrawler class
The NoSQL data warehouse documents
The ORM and ODM software patterns
Implementing the ODM class
Data categories and user document classes
Gathering raw data into the data warehouse
Troubleshooting
Selenium issues
Import our backed-up data
Summary
References
RAG Feature Pipeline
Understanding RAG
Why use RAG?
Hallucinations
Old information
The vanilla RAG framework
Ingestion pipeline
Retrieval pipeline
Generation pipeline
What are embeddings?
Why embeddings are so powerful
How are embeddings created?
Applications of embeddings
More on vector DBs
How does a vector DB work?
Algorithms for creating the vector index
DB operations
An overview of advanced RAG
Pre-retrieval
Retrieval
Post-retrieval
Exploring the LLM Twin’s RAG feature pipeline architecture
The problem we are solving
The feature store
Where does the raw data come from?
Designing the architecture of the RAG feature pipeline
Batch pipelines
Batch versus streaming pipelines
Core steps
Change data capture: syncing the data warehouse and feature store
Why is the data stored in two snapshots?
Orchestration
Implementing the LLM Twin’s RAG feature pipeline
Settings
ZenML pipeline and steps
Querying the data warehouse
Cleaning the documents
Chunk and embed the cleaned documents
Loading the documents to the vector DB
Pydantic domain entities
OVM
The dispatcher layer
The handlers
The cleaning handlers
The chunking handlers
The embedding handlers
Summary
References
Join our book’s Discord space
Supervised Fine-Tuning
Creating an instruction dataset
General framework
Data quantity
Data curation
Rule-based filtering
Data deduplication
Data decontamination
Data quality evaluation
Data exploration
Data generation
Data augmentation
Creating our own instruction dataset
Exploring SFT and its techniques
When to fine-tune
Instruction dataset formats
Chat templates
Parameter-efficient fine-tuning techniques
Full fine-tuning
LoRA
QLoRA
Training parameters
Learning rate and scheduler
Batch size
Maximum length and packing
Number of epochs
Optimizers
Weight decay
Gradient checkpointing
Fine-tuning in practice
Summary
References
Fine-Tuning with Preference Alignment
Understanding preference datasets
Preference data
Data quantity
Data generation and evaluation
Generating preferences
Tips for data generation
Evaluating preferences
Creating our own preference dataset
Preference alignment
Reinforcement Learning from Human Feedback
Direct Preference Optimization
Implementing DPO
Summary
References
Join our book’s Discord space
Evaluating LLMs
Model evaluation
Comparing ML and LLM evaluation
General-purpose LLM evaluations
Domain-specific LLM evaluations
Task-specific LLM evaluations
RAG evaluation
Ragas
ARES
Evaluating TwinLlama-3.1-8B
Generating answers
Evaluating answers
Analyzing results
Summary
References
Inference Optimization
Model optimization strategies
KV cache
Continuous batching
Speculative decoding
Optimized attention mechanisms
Model parallelism
Data parallelism
Pipeline parallelism
Tensor parallelism
Combining approaches
Model quantization
Introduction to quantization
Quantization with GGUF and llama.cpp
Quantization with GPTQ and EXL2
Other quantization techniques
Summary
References
Join our book’s Discord space
RAG Inference Pipeline
Understanding the LLM Twin’s RAG inference pipeline
Exploring the LLM Twin’s advanced RAG techniques
Advanced RAG pre-retrieval optimizations: query expansion and self-querying
Query expansion
Self-querying
Advanced RAG retrieval optimization: filtered vector search
Advanced RAG post-retrieval optimization: reranking
Implementing the LLM Twin’s RAG inference pipeline
Implementing the retrieval module
Bringing everything together into the RAG inference pipeline
Summary
References
Inference Pipeline Deployment
Criteria for choosing deployment types
Throughput and latency
Data
Understanding inference deployment types
Online real-time inference
Asynchronous inference
Offline batch transform
Monolithic versus microservices architecture in model serving
Monolithic architecture
Microservices architecture
Choosing between monolithic and microservices architectures
Exploring the LLM Twin’s inference pipeline deployment strategy
The training versus the inference pipeline
Deploying the LLM Twin service
Implementing the LLM microservice using AWS SageMaker
What are Hugging Face’s DLCs?
Configuring SageMaker roles
Deploying the LLM Twin model to AWS SageMaker
Calling the AWS SageMaker Inference endpoint
Building the business microservice using FastAPI
Autoscaling capabilities to handle spikes in usage
Registering a scalable target
Creating a scalable policy
Minimum and maximum scaling limits
Cooldown period
Summary
References
Join our book’s Discord space
MLOps and LLMOps
The path to LLMOps: Understanding its roots in DevOps and MLOps
DevOps
The DevOps lifecycle
The core DevOps concepts
MLOps
MLOps core components
MLOps principles
ML vs. MLOps engineering
LLMOps
Human feedback
Guardrails
Prompt monitoring
Deploying the LLM Twin’s pipelines to the cloud
Understanding the infrastructure
Setting up MongoDB
Setting up Qdrant
Setting up the ZenML cloud
Containerize the code using Docker
Run the pipelines on AWS
Troubleshooting the ResourceLimitExceeded error after running a ZenML pipeline on SageMaker
Adding LLMOps to the LLM Twin
LLM Twin’s CI/CD pipeline flow
More on formatting errors
More on linting errors
Quick overview of GitHub Actions
The CI pipeline
GitHub Actions CI YAML file
The CD pipeline
Test out the CI/CD pipeline
The CT pipeline
Initial triggers
Trigger downstream pipelines
Prompt monitoring
Alerting
Summary
References
MLOps Principles
1. Automation or operationalization
2. Versioning
3. Experiment tracking
4. Testing
Test types
What do we test?
Test examples
5. Monitoring
Logs
Metrics
System metrics
Model metrics
Drifts
Monitoring vs. observability
Alerts
6. Reproducibility
Other Books You May Enjoy
Index
Cover
Index
LLM Engineer’s Handbook
Master the art of engineering large language models from concept to production
Paul Iusztin
Maxime Labonne
LLM Engineer’s Handbook
Copyright © 2024 Packt Publishing
All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.
Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the authors, nor Packt Publishing or its dealers and distributors, will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book.
Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.
Senior Publishing Product Manager: Gebin George
Acquisition Editor – Peer Reviews: Swaroop Singh
Project Editor: Amisha Vathare
Content Development Editor: Tanya D’cruz
Copy Editor: Safis Editing
Technical Editor: Karan Sonawane
Proofreader: Safis Editing
Indexer: Manju Arasan
Presentation Designer: Rajesh Shirsath
Developer Relations Marketing Executive: Anamika Singh
First published: October 2024
Production reference: 4070725
Published by Packt Publishing Ltd.
Grosvenor House
11 St Paul’s Square
Birmingham
B3 1RB, UK.
ISBN 978-1-83620-007-9
www.packt.com
As my co-founder at Hugging Face, Clement Delangue, and I often say, AI is becoming the default way of building technology.
Over the past 3 years, LLMs have already had a profound impact on technology, and they are bound to have an even greater impact in the coming 5 years. They will be embedded in more and more products and, I believe, at the center of any human activity based on knowledge or creativity.
For instance, coders are already leveraging LLMs and changing the way they work, focusing on higher-order thinking and tasks while collaborating with machines. Studio musicians rely on AI-powered tools to explore the musical creativity space faster. Lawyers are increasing their impact through retrieval-augmented generation (RAG) and large databases of case law.
At Hugging Face, we’ve always advocated for a future where not just one company or a small number of scientists control the AI models used by the rest of the population, but instead for a future where as many people as possible—from as many different backgrounds as possible—are capable of diving into how cutting-edge machine learning models actually work.
Maxime Labonne and Paul Iusztin have been instrumental in this movement to democratize LLMs by writing this book and making sure that as many people as possible can not only use them but also adapt them, fine-tune them, quantize them, and make them efficient enough to actually deploy in the real world.
Their work is essential, and I’m glad they are making this resource available to the community. This expands the convex hull of human knowledge.
Julien Chaumond
Co-founder and CTO, Hugging Face
As someone deeply immersed in the world of machine learning operations, I’m thrilled to endorse The LLM Engineer’s Handbook. This comprehensive guide arrives at a crucial time when the demand for LLM expertise is skyrocketing across industries.
What sets this book apart is its practical, end-to-end approach. By walking readers through the creation of an LLM Twin, it bridges the often daunting gap between theory and real-world application. From data engineering and model fine-tuning to advanced topics like RAG pipelines and inference optimization, the authors leave no stone unturned.
I’m particularly impressed by the emphasis on MLOps and LLMOps principles. As organizations increasingly rely on LLMs, understanding how to build scalable, reproducible, and robust systems is paramount. The inclusion of orchestration strategies and cloud integration showcases the authors’ commitment to equipping readers with truly production-ready skills.
Whether you’re a seasoned ML practitioner looking to specialize in LLMs or a software engineer aiming to break into this exciting field, this handbook provides the perfect blend of foundational knowledge and cutting-edge techniques. The clear explanations, practical examples, and focus on best practices make it an invaluable resource for anyone serious about mastering LLM engineering.
In an era where AI is reshaping industries at breakneck speed, The LLM Engineer’s Handbook stands out as an essential guide for navigating the complexities of large language models. It’s not just a book; it’s a roadmap to becoming a proficient LLM engineer in today’s AI-driven landscape.
Hamza Tahir
Co-founder and CTO, ZenML
The LLM Engineer’s Handbook serves as an invaluable resource for anyone seeking a hands-on understanding of LLMs. Through practical examples and a comprehensive exploration of the LLM Twin project, the author effectively demystifies the complexities of building and deploying production-level LLM applications.
One of the book’s standout features is its use of the LLM Twin project as a running example. This AI character, designed to emulate the writing style of a specific individual, provides a tangible illustration of how LLMs can be applied in real-world scenarios.
The author skillfully guides readers through the essential tools and technologies required for LLM development, including Hugging Face, ZenML, Comet, Opik, MongoDB, and Qdrant. Each tool is explained in detail, making it easy for readers to understand their functions and how they can be integrated into an LLM pipeline.
LLM Engineer’s Handbook also covers a wide range of topics related to LLM development, such as data collection, fine-tuning, evaluation, inference optimization, and MLOps. Notably, the chapters on supervised fine-tuning, preference alignment, and Retrieval Augmented Generation (RAG) provide in-depth insights into these critical aspects of LLM development.
A particular strength of this book lies in its focus on practical implementation. The author excels at providing concrete examples and guidance on how to optimize inference pipelines and deploy LLMs effectively. This makes the book a valuable resource for both researchers and practitioners.
This book is highly recommended for anyone interested in learning about LLMs and their practical applications. By providing a comprehensive overview of the tools, techniques, and best practices involved in LLM development, the authors have created a valuable resource that will undoubtedly be a reference for many LLM Engineers
Antonio Gulli
Senior Director, Google
Paul Iusztin is a senior ML and MLOps engineer with over seven years of experience building GenAI, Computer Vision and MLOps solutions. His latest contribution was at Metaphysic, where he served as one of their core engineers in taking large neural networks to production. He previously worked at CoreAI, Everseen, and Continental. He is the Founder of Decoding ML, an educational channel on production-grade ML that provides posts, articles, and open-source courses to help others build real-world ML systems.
Maxime Labonne is the Head of Post-Training at Liquid AI. He holds a PhD. in ML from the Polytechnic Institute of Paris and is recognized as a Google Developer Expert in AI/ML. As an active blogger, he has made significant contributions to the open-source community, including the LLM Course on GitHub, tools such as LLM AutoEval, and several state-of-the-art models like NeuralDaredevil. He is the author of the best-selling book Hands-On Graph Neural Networks Using Python, published by Packt.
I want to thank my family and partner. Your unwavering support and patience made this book possible.
Rany ElHousieny is an AI solutions architect and AI engineering manager with over two decades of experience in AI, NLP, and ML. Throughout his career, he has focused on the development and deployment of AI models, authoring multiple articles on AI systems architecture and ethical AI deployment. He has led groundbreaking projects at companies like Microsoft, where he spearheaded advancements in NLP and the Language Understanding Intelligent Service (LUIS). Currently, he plays a pivotal role at Clearwater Analytics, driving innovation in GenAI and AI-driven financial and investment management solutions.
I would like to thank Clearwater Analytics for providing a supportive and learning environment that fosters growth and innovation. The vision of our leaders, always staying ahead with the latest technologies, has been a constant source of inspiration. Their commitment to AI advancements made my experience of reviewing this book insightful and enriching. Special thanks to my family for their ongoing encouragement throughout this journey.
Join our community’s Discord space for discussions with the authors and other readers:
https://packt.link/llmeng
Once you’ve read LLM Engineer’s Handbook, First Edition, we’d love to hear your thoughts! Please click here to go straight to the Amazon review page for this book and share your feedback.
Your review is important to us and the tech community and will help us make sure we’re delivering excellent quality content.
Unlock exclusive free benefits that come with your purchase, thoughtfully crafted to supercharge your learning journey and help you learn without limits.
https://www.packtpub.com/unlock/9781836200079
Note: Have your purchase invoice ready before you begin.
Figure 1.1: Next-Gen Reader, AI Assistant (Beta), and Free PDF access
Enhanced reading experience with our Next-gen Reader:
Multi-device progress sync: Learn from any device with seamless progress sync.
Highlighting and Notetaking: Turn your reading into lasting knowledge.
Bookmarking: Revisit your most important learnings anytime.
Dark mode: Focus with minimal eye strain by switching to dark or sepia modes.
Learn smarter using our AI assistant (Beta):
Summarize it: Summarize key sections or an entire chapter.
AI code explainers: In Packt Reader, click the “Explain” button above each code block for AI-powered code explanations.
Note: AI Assistant is part of next-gen Packt Reader and is still in beta.
Learn anytime, anywhere:
Access your content offline with DRM-free PDF and ePub versions—compatible with your favorite e-readers.
Your copy of this book comes with the following exclusive benefits:
Next-gen Packt Reader
AI assistant (beta)
DRM-free PDF/ePub downloads
Use the following guide to unlock them if you haven’t already. The process takes just a few minutes and needs to be done only once.
Have your purchase invoice for this book ready, as you’ll need it in Step 3. If you received a physical invoice, scan it on your phone and have it ready as either a PDF, JPG, or PNG.
For more help on finding your invoice, visit https://www.packtpub.com/unlock-benefits/help.
Note: Bought this book directly from Packt? You don’t need an invoice. After completing Step 2, you can jump straight to your exclusive content.
Scan the following QR code or visit https://www.packtpub.com/unlock/9781836200079:
Sign in to your Packt account or create a new one for free. Once you’re logged in, upload your invoice. It can be in PDF, PNG, or JPG format and must be no larger than 10 MB. Follow the rest of the instructions on the screen to complete the process.
If you get stuck and need help, visit https://www.packtpub.com/unlock-benefits/help for a detailed FAQ on how to find your invoices and more. The following QR code will take you to the help page directly:
Note: If you are still facing issues, reach out to [email protected].
By the end of this book, we will have walked you through the journey of building an end-to-end large language model (LLM) product. We firmly believe that the best way to learn about LLMs and production machine learning (ML) is to get your hands dirty and build systems. This book will show you how to build an LLM Twin, an AI character that learns to write like a particular person by incorporating its style, voice, and personality into an LLM. Using this example, we will walk you through the complete ML life cycle, from data gathering to deployment and monitoring. Most of the concepts learned while implementing your LLM Twin can be applied in other LLM-based or ML applications.
When starting to implement a new product, from an engineering point of view, there are three planning steps we must go through before we start building. First, it is critical to understand the problem we are trying to solve and what we want to build. In our case, what exactly is an LLM Twin, and why build it? This step is where we must dream and focus on the “Why.” Secondly, to reflect a real-world scenario, we will design the first iteration of a product with minimum functionality. Here, we must clearly define the core features required to create a working and valuable product. The choices are made based on the timeline, resources, and team’s knowledge. This is where we bridge the gap between dreaming and focusing on what is realistic and eventually answer the following question: “What are we going to build?”.
Finally, we will go through a system design step, laying out the core architecture and design choices used to build the LLM system. Note that the first two components are primarily product-related, while the last one is technical and focuses on the “How.”
These three steps are natural in building a real-world product. Even if the first two do not require much ML knowledge, it is critical to go through them to understand “how” to build the product with a clear vision. In a nutshell, this chapter covers the following topics:
Understanding the LLM Twin conceptPlanning the MVP of the LLM Twin productBuilding ML systems with feature/training/inference pipelinesDesigning the system architecture of the LLM TwinBy the end of this chapter, you will have a clear picture of what you will learn to build throughout the book.
The first step is to have a clear vision of what we want to create and why it’s valuable to build it. The concept of an LLM Twin is new. Thus, before diving into the technical details, it is essential to understand what it is, what we should expect from it, and how it should work. Having a solid intuition of your end goal makes it much easier to digest the theory, code, and infrastructure presented in this book.
In a few words, an LLM Twin is an AI character that incorporates your writing style, voice, and personality into an LLM, which is a complex AI model. It is a digital version of yourself projected into an LLM. Instead of a generic LLM trained on the whole internet, an LLM Twin is fine-tuned on yourself. Naturally, as an ML model reflects the data it is trained on, this LLM will incorporate your writing style, voice, and personality. We intentionally used the word “projected.” As with any other projection, you lose a lot of information along the way. Thus, this LLM will not be you; it will copy the side of you reflected in the data it was trained on.
It is essential to understand that an LLM reflects the data it was trained on. If you feed it Shakespeare, it will start writing like him. If you train it on Billie Eilish, it will start writing songs in her style. This is also known as style transfer. This concept is prevalent in generating images, too. For example, let’s say you want to create a cat image using Van Gogh’s style. We will leverage the style transfer strategy, but instead of choosing a personality, we will do it on our own persona.
To adjust the LLM to a given style and voice along with fine-tuning, we will also leverage various advanced retrieval-augmented generation (RAG) techniques to condition the autoregressive process with previous embeddings of ourselves.
We will explore the details in Chapter 5 on fine-tuning and Chapters 4 and 9 on RAG, but for now, let’s look at a few examples to intuitively understand what we stated previously.
Here are some scenarios of what you can fine-tune an LLM on to become your twin:
LinkedIn posts and X threads: Specialize the LLM in writing social media content.Messages with your friends and family: Adapt the LLM to an unfiltered version of yourself.Academic papers and articles: Calibrate the LLM in writing formal and educative content.Code: Specialize the LLM in implementing code as you would.All the preceding scenarios can be reduced to one core strategy: collecting your digital data (or some parts of it) and feeding it to an LLM using different algorithms. Ultimately, the LLM reflects the voice and style of the collected data. Easy, right?
Unfortunately, this raises many technical and moral issues. First, on the technical side, how can we access this data? Do we have enough digital data to project ourselves into an LLM? What kind of data would be valuable? Secondly, on the moral side, is it OK to do this in the first place? Do we want to create a copycat of ourselves? Will it write using our voice and personality, or just try to replicate it?
Remember that the role of this section is not to bother with the “What” and “How” but with the “Why.” Let’s understand why it makes sense to have your LLM Twin, why it can be valuable, and why it is morally correct if we frame the problem correctly.
As an engineer (or any other professional career), building a personal brand is more valuable than a standard CV. The biggest issue with creating a personal brand is that writing content on platforms such as LinkedIn, X, or Medium takes a lot of time. Even if you enjoy writing and creating content, you will eventually run out of inspiration or time and feel like you need assistance. We don’t want to transform this section into a pitch, but we have to understand the scope of this product/project clearly.
We want to build an LLM Twin to write personalized content on LinkedIn, X, Instagram, Substack, and Medium (or other blogs) using our style and voice. It will not be used in any immoral scenarios, but it will act as your writing co-pilot. Based on what we will teach you in this book, you can get creative and adapt it to various use cases, but we will focus on the niche of generating social media content and articles. Thus, instead of writing the content from scratch, we can feed the skeleton of our main idea to the LLM Twin and let it do the grunt work.
Ultimately, we will have to check whether everything is correct and format it to our liking (more on the concrete features in the Planning the MVP of the LLM Twin product section). Hence, we project ourselves into a content-writing LLM Twin that will help us automate our writing process. It will likely fail if we try to use this particular LLM in a different scenario, as this is where we will specialize the LLM through fine-tuning, prompt engineering, and RAG.
So, why does building an LLM Twin matter? It helps you do the following:
Create your brandAutomate the writing processBrainstorm new creative ideasWhat’s the difference between a co-pilot and an LLM Twin?
A co-pilot and digital twin are two different concepts that work together and can be combined into a powerful solution:
The co-pilot is an AI assistant or tool that augments human users in various programming, writing, or content creation tasks.The twin serves as a 1:1 digital representation of a real-world entity, often using AI to bridge the gap between the physical and digital worlds. For instance, an LLM Twin is an LLM that learns to mimic your voice, personality, and writing style.With these definitions in mind, a writing and content creation AI assistant who writes like you is your LLM Twin co-pilot.
Also, it is critical to understand that building an LLM Twin is entirely moral. The LLM will be fine-tuned only on our personal digital data. We won’t collect and use other people’s data to try to impersonate anyone’s identity. We have a clear goal in mind: creating our personalized writing copycat. Everyone will have their own LLM Twin with restricted access.
Of course, many security concerns are involved, but we won’t go into that here as it could be a book in itself.
This subsection will refer to using ChatGPT (or another similar chatbot) just in the context of generating personalized content.
We have already provided the answer. ChatGPT is not personalized to your writing style and voice. Instead, it is very generic, unarticulated, and wordy. Maintaining an original voice is critical for long-term success when building your brand. Thus, directly using ChatGPT or Gemini will not yield the most optimal results. Even if you are OK with sharing impersonalized content, mindlessly using ChatGPT can result in the following:
Misinformation due to hallucination: Manually checking the results for hallucinations or using third-party tools to evaluate your results is a tedious and unproductive experience.Tedious manual prompting: You must manually craft your prompts and inject external information, which is a tiresome experience. Also, the generated answers will be hard to replicate between multiple sessions as you don’t have complete control over your prompts and injected data. You can solve part of this problem using an API and a tool such as LangChain, but you need programming experience to do so.From our experience, if you want high-quality content that provides real value, you will spend more time debugging the generated text than writing it yourself.
The key of the LLM Twin stands in the following:
What data we collectHow we preprocess the dataHow we feed the data into the LLMHow we chain multiple prompts for the desired resultsHow we evaluate the generated contentThe LLM itself is important, but we want to highlight that using ChatGPT’s web interface is exceptionally tedious in managing and injecting various data sources or evaluating the outputs. The solution is to build an LLM system that encapsulates and automates all the following steps (manually replicating them each time is not a long-term and feasible solution):
Data collectionData preprocessingData storage, versioning, and retrievalLLM fine-tuningRAGContent generation evaluationNote that we never said not to use OpenAI’s GPT API, just that the LLM framework we will present is LLM-agnostic. Thus, if it can be manipulated programmatically and exposes a fine-tuning interface, it can be integrated into the LLM Twin system we will learn to build. The key to most successful ML products is to be data-centric and make your architecture model-agnostic. Thus, you can quickly experiment with multiple models on your specific data.
Now that we understand what an LLM Twin is and why we want to build it, we must clearly define the product’s features. In this book, we will focus on the first iteration, often labeled the minimum viable product (MVP), to follow the natural cycle of most products. Here, the main objective is to align our ideas with realistic and doable business objectives using the available resources to produce the product. Even as an engineer, as you grow up in responsibilities, you must go through these steps to bridge the gap between the business needs and what can be implemented.
An MVP is a version of a product that includes just enough features to draw in early users and test the viability of the product concept in the initial stages of development. Usually, the purpose of the MVP is to gather insights from the market with minimal effort.
An MVP is a powerful strategy because of the following reasons:
Accelerated time-to-market: Launch a product quickly to gain early tractionIdea validation: Test it with real users before investing in the full development of the productMarket research: Gain insights into what resonates with the target audienceRisk minimization: Reduces the time and resources needed for a product that might not achieve market successSticking to the V in MVP is essential, meaning the product must be viable. The product must provide an end-to-end user journey without half-implemented features, even if the product is minimal. It must be a working product with a good user experience that people will love and want to keep using to see how it evolves to its full potential.
As a thought experiment, let’s assume that instead of building this project for this book, we want to make a real product. In that case, what are our resources? Well, unfortunately, not many:
We are a team of three people with two ML engineers and one ML researcherOur laptopsPersonal funding for computing, such as training LLMsOur enthusiasmAs you can see, we don’t have many resources. Even if this is just a thought experiment, it reflects the reality for most start-ups at the beginning of their journey. Thus, we must be very strategic in defining our LLM Twin MVP and what features we want to pick. Our goal is simple: we want to maximize the product’s value relative to the effort and resources poured into it.
To keep it simple, we will build the features that can do the following for the LLM Twin:
Collect data from your LinkedIn, Medium, Substack, and GitHub profilesFine-tune an open-source LLM using the collected dataPopulate a vector database (DB) using our digital data for RAGCreate LinkedIn posts leveraging the following:User promptsRAG to reuse and reference old contentNew posts, articles, or papers as additional knowledge to the LLMHave a simple web interface to interact with the LLM Twin and be able to do the following:Configure your social media links and trigger the collection stepSend prompts or links to external resourcesThat will be the LLM Twin MVP. Even if it doesn’t sound like much, remember that we must make this system cost effective, scalable, and modular.
Even if we focus only on the core features of the LLM Twin defined in this section, we will build the product with the latest LLM research and best software engineering and MLOps practices in mind. We aim to show you how to engineer a cost-effective and scalable LLM application.
Until now, we have examined the LLM Twin from the users’ and businesses’ perspectives. The last step is to examine it from an engineering perspective and define a development plan to understand how to solve it technically. From now on, the book’s focus will be on the implementation of the LLM Twin.
Before diving into the specifics of the LLM Twin architecture, we must understand an ML system pattern at the core of the architecture, known as the feature/training/inference (FTI) architecture. This section will present a general overview of the FTI pipeline design and how it can structure an ML application.
Let’s see how we can apply the FTI pipelines to the LLM Twin architecture.
Building production-ready ML systems is much more than just training a model. From an engineering point of view, training the model is the most straightforward step in most use cases. However, training a model becomes complex when deciding on the correct architecture and hyperparameters. That’s not an engineering problem but a research problem.
At this point, we want to focus on how to design a production-ready architecture. Training a model with high accuracy is extremely valuable, but just by training it on a static dataset, you are far from deploying it robustly. We have to consider how to do the following:
Ingest, clean, and validate fresh dataTraining versus inference setupsCompute and serve features in the right environmentServe the model in a cost-effective wayVersion, track, and share the datasets and modelsMonitor your infrastructure and modelsDeploy the model on a scalable infrastructureAutomate the deployments and trainingThese are the types of problems an ML or MLOps engineer must consider, while the research or data science team is often responsible for training the model.
Figure 1.1: Common elements from an ML system
The preceding figure shows all the components the Google Cloud team suggests that a mature ML and MLOps system requires. Along with the ML code, there are many moving pieces. The rest of the system comprises configuration, automation, data collection, data verification, testing and debugging, resource management, model analysis, process and metadata management, serving infrastructure, and monitoring. The point is that there are many components we must consider when productionizing an ML model.
Thus, the critical question is this: How do we connect all these components into a single homogenous system? We must create a boilerplate for clearly designing ML systems to answer that question.
Similar solutions exist for classic software. For example, if you zoom out, most software applications can be split between a DB, business logic, and UI layer. Every layer can be as complex as needed, but at a high-level overview, the architecture of standard software can be boiled down to the previous three components.
Do we have something similar for ML applications? The first step is to examine previous solutions and why they are unsuitable for building scalable ML systems.
In Figure 1.2, you can observe the typical architecture present in most ML applications. It is based on a monolithic batch architecture that couples the feature creation, model training, and inference into the same component. By taking this approach, you quickly solve one critical problem in the ML world: the training-serving skew. The training-serving skew happens when the features passed to the model are computed differently at training and inference time.
In this architecture, the features are created using the same code. Hence, the training-serving skew issue is solved by default. This pattern works fine when working with small data. The pipeline runs on a schedule in batch mode, and the predictions are consumed by a third-party application such as a dashboard.
Figure 1.2: Monolithic batch pipeline architectureUnfortunately, building a monolithic batch system raises many other issues, such as the following:
Features are not reusable (by your system or others)If the data increases, you have to refactor the whole code to support PySpark or RayIt’s hard to rewrite the prediction module in a more efficient language such as C++, Java, or RustIt’s hard to share the work between multiple teams between the features, training, and prediction modulesIt’s impossible to switch to streaming technology for real-time trainingIn Figure 1.3, we can see a similar scenario for a real-time system. This use case introduces another issue in addition to what we listed before. To make the predictions, we have to transfer the whole state through the client request so the features can be computed and passed to the model.
Consider the scenario of computing movie recommendations for a user. Instead of simply passing the user ID, we must transmit the entire user state, including their name, age, gender, movie history, and more. This approach is fraught with potential errors, as the client must understand how to access this state, and it’s tightly coupled with the model service.
Another example would be when implementing an LLM with RAG support. The documents we add as context along the query represent our external state. If we didn’t store the records in a vector DB, we would have to pass them with the user query. To do so, the client must know how to query and retrieve the documents, which is not feasible. It is an antipattern for the client application to know how to access or compute the features. If you don’t understand how RAG works, we will explain it in detail in Chapters 8 and 9.
Figure 1.3: Stateless real-time architecture
In conclusion, our problem is accessing the features to make predictions without passing them at the client’s request. For example, based on our first user movie recommendation example, how can we predict the recommendations solely based on the user’s ID? Remember these questions, as we will answer them shortly.
Ultimately, on the other spectrum, Google Cloud provides a production-ready architecture, as shown in Figure 1.4. Unfortunately, even if it’s a feasible solution, it’s very complex and not intuitive. You will have difficulty understanding this if you are not highly experienced in deploying and keeping ML models in production. Also, it is not straightforward to understand how to start small and grow the system in time.
The following image is reproduced from work created and shared by Google and used according to terms described in the Creative Commons 4.0 Attribution License:
Figure 1.4: ML pipeline automation for CT (source: https://cloud.google.com/architecture/mlops-continuous-delivery-and-automation-pipelines-in-machine-learning)
But here is where the FTI pipeline architectures kick in. The following section will show you how to solve these fundamental issues using an intuitive ML design.
The solution is based on creating a clear and straightforward mind map that any team or person can follow to compute the features, train the model, and make predictions. Based on these three critical steps that any ML system requires, the pattern is known as the FTI pipeline. So, how does this differ from what we presented before?
The pattern suggests that any ML system can be boiled down to these three pipelines: feature, training, and inference (similar to the DB, business logic, and UI layers from classic software). This is powerful, as we can clearly define the scope and interface of each pipeline. Also, it’s easier to understand how the three components interact. Ultimately, we have just three instead of 20 moving pieces, as suggested in Figure 1.4, which is much easier to work with and define.
As shown in Figure 1.5, we have the feature, training, and inference pipelines. We will zoom in on each of them and understand their scope and interface.
Figure 1.5: FTI pipelines architecture
Before going into the details, it is essential to understand that each pipeline is a different component that can run on a different process or hardware. Thus, each pipeline can be written using a different technology, by a different team, or scaled differently. The key idea is that the design is very flexible to the needs of your team. It acts as a mind map for structuring your architecture.
The feature pipeline takes raw data as input, processes it, and outputs the features and labels required by the model for training or inference. Instead of directly passing them to the model, the features and labels are stored inside a feature store. Its responsibility is to store, version, track, and share the features. By saving the features in a feature store, we always have a state of our features. Thus, we can easily send the features to the training and inference pipelines.
As the data is versioned, we can always ensure that the training and inference time features match. Thus, we avoid the training-serving skew problem.
The training pipeline takes the features and labels from the features stored as input and outputs a train model or models. The models are stored in a model registry. Its role is similar to that of feature stores, but this time, the model is the first-class citizen. Thus, the model registry will store, version, track, and share the model with the inference pipeline.
Also, most modern model registries support a metadata store that allows you to specify essential aspects of how the model was trained. The most important are the features, labels, and their version used to train the model. Thus, we will always know what data the model was trained on.
The inference pipeline takes as input the features and labels from the feature store and the trained model from the model registry. With these two, predictions can be easily made in either batch or real-time mode.
As this is a versatile pattern, it is up to you to decide what you do with your predictions. If it’s a batch system, they will probably be stored in a DB. If it’s a real-time system, the predictions will be served to the client who requested them. Additionally, the features, labels, and models are versioned. We can easily upgrade or roll back the deployment of the model. For example, we will always know that model v1 uses features F1, F2, and F3, and model v2 uses F2, F3, and F4. Thus, we can quickly change the connections between the model and features.
To conclude, the most important thing you must remember about the FTI pipelines is their interface:
The feature pipeline takes in data and outputs the features and labels saved to the feature store.The training pipeline queries the features store for features and labels and outputs a model to the model registry.The inference pipeline uses the features from the feature store and the model from the model registry to make predictions.It doesn’t matter how complex your ML system gets, these interfaces will remain the same.
Now that we understand better how the pattern works, we want to highlight the main benefits of using this pattern:
As you have just three components, it is intuitive to use and easy to understand.Each component can be written into its tech stack, so we can quickly adapt them to specific needs, such as big or streaming data. Also, it allows us to pick the best tools for the job.As there is a transparent interface between the three components, each one can be developed by a different team (if necessary), making the development more manageable and scalable.Every component can be deployed, scaled, and monitored independently.The final thing you must understand about the FTI pattern is that the system doesn’t have to contain only three pipelines. In most cases, it will include more. For example, the feature pipeline can be composed of a service that computes the features and one that validates the data. Also, the training pipeline can be composed of the training and evaluation components.
The FTI pipelines act as logical layers. Thus, it is perfectly fine for each to be complex and contain multiple services. However, what is essential is to stick to the same interface on how the FTI pipelines interact with each other through the feature store and model registries. By doing so, each FTI component can evolve differently, without knowing the details of each other and without breaking the system on new changes.
To learn more about the FTI pipeline pattern, consider reading From MLOps to ML Systems with Feature/Training/Inference Pipelines by Jim Dowling, CEO and co-founder of Hopsworks: https://www.hopsworks.ai/post/mlops-to-ml-systems-with-fti-pipelines. His article inspired this section.
Now that we understand the FTI pipeline architecture, the final step of this chapter is to see how it can be applied to the LLM Twin use case.
In this section, we will list the concrete technical details of the LLM Twin application and understand how we can solve them by designing our LLM system using the FTI architecture. However, before diving into the pipelines, we want to highlight that we won’t focus on the tooling or the tech stack at this step. We only want to define a high-level architecture of the system, which is language-, framework-, platform-, and infrastructure-agnostic at this point. We will focus on each component’s scope, interface, and interconnectivity. In future chapters, we will cover the implementation details and tech stack.
Until now, we defined what the LLM Twin should support from the user’s point of view. Now, let’s clarify the requirements of the ML system from a purely technical perspective:
On the data side, we have to do the following:Collect data from LinkedIn, Medium, Substack, and GitHub completely autonomously and on a scheduleStandardize the crawled data and store it in a data warehouseClean the raw dataCreate instruct datasets for fine-tuning an LLMChunk and embed the cleaned data. Store the vectorized data into a vector DB for RAG.For training, we have to do the following:Fine-tune LLMs of various sizes (7B, 14B, 30B, or 70B parameters)Fine-tune on instruction datasets of multiple sizesSwitch between LLM types (for example, between Mistral, Llama, and GPT)Track and compare experimentsTest potential production LLM candidates before deploying themAutomatically start the training when new instruction datasets are available.The inference code will have the following properties:A REST API interface for clients to interact with the LLM TwinAccess to the vector DB in real time for RAGInference with LLMs of various sizesAutoscaling based on user requestsAutomatically deploy the LLMs that pass the evaluation step.The system will support the following LLMOps features:Instruction dataset versioning, lineage, and reusabilityModel versioning, lineage, and reusabilityExperiment trackingContinuous training, continuous integration, and continuous delivery (CT/CI/CD)Prompt and system monitoringIf any technical requirement doesn’t make sense now, bear with us. To avoid repetition, we will examine the details in their specific chapter.
The preceding list is quite comprehensive. We could have detailed it even more, but at this point, we want to focus on the core functionality. When implementing each component, we will look into all the little details. But for now, the fundamental question we must ask ourselves is this: How can we apply the FTI pipeline design to implement the preceding list of requirements?
We will split the system into four core