LLM Engineer's Handbook - Paul Iusztin - E-Book

LLM Engineer's Handbook E-Book

Paul Iusztin

0,0
29,99 €

-100%
Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.

Mehr erfahren.
Beschreibung

Artificial intelligence has undergone rapid advancements, and Large Language Models (LLMs) are at the forefront of this revolution. This LLM book offers insights into designing, training, and deploying LLMs in real-world scenarios by leveraging MLOps best practices. The guide walks you through building an LLM-powered twin that’s cost-effective, scalable, and modular. It moves beyond isolated Jupyter notebooks, focusing on how to build production-grade end-to-end LLM systems.
Throughout this book, you will learn data engineering, supervised fine-tuning, and deployment. The hands-on approach to building the LLM Twin use case will help you implement MLOps components in your own projects. You will also explore cutting-edge advancements in the field, including inference optimization, preference alignment, and real-time data processing, making this a vital resource for those looking to apply LLMs in their projects.
By the end of this book, you will be proficient in deploying LLMs that solve practical problems while maintaining low-latency and high-availability inference capabilities. Whether you are new to artificial intelligence or an experienced practitioner, this book delivers guidance and practical techniques that will deepen your understanding of LLMs and sharpen your ability to implement them effectively.

Das E-Book können Sie in Legimi-Apps oder einer beliebigen App lesen, die das folgende Format unterstützen:

EPUB
MOBI

Seitenzahl: 656

Veröffentlichungsjahr: 2024

Bewertungen
0,0
0
0
0
0
0
Mehr Informationen
Mehr Informationen
Legimi prüft nicht, ob Rezensionen von Nutzern stammen, die den betreffenden Titel tatsächlich gekauft oder gelesen/gehört haben. Wir entfernen aber gefälschte Rezensionen.



Contents

Forewords

Contributors

Join our book’s Discord space

Preface

Who this book is for

What this book covers

To get the most out of this book

Get in touch

Making the Most Out of This Book – Get to Know Your Free Benefits

Understanding the LLM Twin Concept and Architecture

Understanding the LLM Twin concept

What is an LLM Twin?

Why building an LLM Twin matters

Why not use ChatGPT (or another similar chatbot)?

Planning the MVP of the LLM Twin product

What is an MVP?

Defining the LLM Twin MVP

Building ML systems with feature/training/inference pipelines

The problem with building ML systems

The issue with previous solutions

The solution – ML pipelines for ML systems

The feature pipeline

The training pipeline

The inference pipeline

Benefits of the FTI architecture

Designing the system architecture of the LLM Twin

Listing the technical details of the LLM Twin architecture

How to design the LLM Twin architecture using the FTI pipeline design

Data collection pipeline

Feature pipeline

Training pipeline

Inference pipeline

Final thoughts on the FTI design and the LLM Twin architecture

Summary

References

Tooling and Installation

Python ecosystem and project installation

Poetry: dependency and virtual environment management

Poe the Poet: task execution tool

MLOps and LLMOps tooling

Hugging Face: model registry

ZenML: orchestrator, artifacts, and metadata

Orchestrator

Artifacts and metadata

How to run and configure a ZenML pipeline

Comet ML: experiment tracker

Opik: prompt monitoring

Databases for storing unstructured and vector data

MongoDB: NoSQL database

Qdrant: vector database

Preparing for AWS

Setting up an AWS account, an access key, and the CLI

SageMaker: training and inference compute

Why AWS SageMaker?

Summary

References

Join our book’s Discord space

Data Engineering

Designing the LLM Twin’s data collection pipeline

Implementing the LLM Twin’s data collection pipeline

ZenML pipeline and steps

The dispatcher: How do you instantiate the right crawler?

The crawlers

Base classes

GitHubCrawler class

CustomArticleCrawler class

MediumCrawler class

The NoSQL data warehouse documents

The ORM and ODM software patterns

Implementing the ODM class

Data categories and user document classes

Gathering raw data into the data warehouse

Troubleshooting

Selenium issues

Import our backed-up data

Summary

References

RAG Feature Pipeline

Understanding RAG

Why use RAG?

Hallucinations

Old information

The vanilla RAG framework

Ingestion pipeline

Retrieval pipeline

Generation pipeline

What are embeddings?

Why embeddings are so powerful

How are embeddings created?

Applications of embeddings

More on vector DBs

How does a vector DB work?

Algorithms for creating the vector index

DB operations

An overview of advanced RAG

Pre-retrieval

Retrieval

Post-retrieval

Exploring the LLM Twin’s RAG feature pipeline architecture

The problem we are solving

The feature store

Where does the raw data come from?

Designing the architecture of the RAG feature pipeline

Batch pipelines

Batch versus streaming pipelines

Core steps

Change data capture: syncing the data warehouse and feature store

Why is the data stored in two snapshots?

Orchestration

Implementing the LLM Twin’s RAG feature pipeline

Settings

ZenML pipeline and steps

Querying the data warehouse

Cleaning the documents

Chunk and embed the cleaned documents

Loading the documents to the vector DB

Pydantic domain entities

OVM

The dispatcher layer

The handlers

The cleaning handlers

The chunking handlers

The embedding handlers

Summary

References

Join our book’s Discord space

Supervised Fine-Tuning

Creating an instruction dataset

General framework

Data quantity

Data curation

Rule-based filtering

Data deduplication

Data decontamination

Data quality evaluation

Data exploration

Data generation

Data augmentation

Creating our own instruction dataset

Exploring SFT and its techniques

When to fine-tune

Instruction dataset formats

Chat templates

Parameter-efficient fine-tuning techniques

Full fine-tuning

LoRA

QLoRA

Training parameters

Learning rate and scheduler

Batch size

Maximum length and packing

Number of epochs

Optimizers

Weight decay

Gradient checkpointing

Fine-tuning in practice

Summary

References

Fine-Tuning with Preference Alignment

Understanding preference datasets

Preference data

Data quantity

Data generation and evaluation

Generating preferences

Tips for data generation

Evaluating preferences

Creating our own preference dataset

Preference alignment

Reinforcement Learning from Human Feedback

Direct Preference Optimization

Implementing DPO

Summary

References

Join our book’s Discord space

Evaluating LLMs

Model evaluation

Comparing ML and LLM evaluation

General-purpose LLM evaluations

Domain-specific LLM evaluations

Task-specific LLM evaluations

RAG evaluation

Ragas

ARES

Evaluating TwinLlama-3.1-8B

Generating answers

Evaluating answers

Analyzing results

Summary

References

Inference Optimization

Model optimization strategies

KV cache

Continuous batching

Speculative decoding

Optimized attention mechanisms

Model parallelism

Data parallelism

Pipeline parallelism

Tensor parallelism

Combining approaches

Model quantization

Introduction to quantization

Quantization with GGUF and llama.cpp

Quantization with GPTQ and EXL2

Other quantization techniques

Summary

References

Join our book’s Discord space

RAG Inference Pipeline

Understanding the LLM Twin’s RAG inference pipeline

Exploring the LLM Twin’s advanced RAG techniques

Advanced RAG pre-retrieval optimizations: query expansion and self-querying

Query expansion

Self-querying

Advanced RAG retrieval optimization: filtered vector search

Advanced RAG post-retrieval optimization: reranking

Implementing the LLM Twin’s RAG inference pipeline

Implementing the retrieval module

Bringing everything together into the RAG inference pipeline

Summary

References

Inference Pipeline Deployment

Criteria for choosing deployment types

Throughput and latency

Data

Understanding inference deployment types

Online real-time inference

Asynchronous inference

Offline batch transform

Monolithic versus microservices architecture in model serving

Monolithic architecture

Microservices architecture

Choosing between monolithic and microservices architectures

Exploring the LLM Twin’s inference pipeline deployment strategy

The training versus the inference pipeline

Deploying the LLM Twin service

Implementing the LLM microservice using AWS SageMaker

What are Hugging Face’s DLCs?

Configuring SageMaker roles

Deploying the LLM Twin model to AWS SageMaker

Calling the AWS SageMaker Inference endpoint

Building the business microservice using FastAPI

Autoscaling capabilities to handle spikes in usage

Registering a scalable target

Creating a scalable policy

Minimum and maximum scaling limits

Cooldown period

Summary

References

Join our book’s Discord space

MLOps and LLMOps

The path to LLMOps: Understanding its roots in DevOps and MLOps

DevOps

The DevOps lifecycle

The core DevOps concepts

MLOps

MLOps core components

MLOps principles

ML vs. MLOps engineering

LLMOps

Human feedback

Guardrails

Prompt monitoring

Deploying the LLM Twin’s pipelines to the cloud

Understanding the infrastructure

Setting up MongoDB

Setting up Qdrant

Setting up the ZenML cloud

Containerize the code using Docker

Run the pipelines on AWS

Troubleshooting the ResourceLimitExceeded error after running a ZenML pipeline on SageMaker

Adding LLMOps to the LLM Twin

LLM Twin’s CI/CD pipeline flow

More on formatting errors

More on linting errors

Quick overview of GitHub Actions

The CI pipeline

GitHub Actions CI YAML file

The CD pipeline

Test out the CI/CD pipeline

The CT pipeline

Initial triggers

Trigger downstream pipelines

Prompt monitoring

Alerting

Summary

References

MLOps Principles

1. Automation or operationalization

2. Versioning

3. Experiment tracking

4. Testing

Test types

What do we test?

Test examples

5. Monitoring

Logs

Metrics

System metrics

Model metrics

Drifts

Monitoring vs. observability

Alerts

6. Reproducibility

Other Books You May Enjoy

Index

Landmarks

Cover

Index

LLM Engineer’s Handbook

Master the art of engineering large language models from concept to production

Paul Iusztin

Maxime Labonne

LLM Engineer’s Handbook

Copyright © 2024 Packt Publishing

All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the authors, nor Packt Publishing or its dealers and distributors, will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book.

Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

Senior Publishing Product Manager: Gebin George

Acquisition Editor – Peer Reviews: Swaroop Singh

Project Editor: Amisha Vathare

Content Development Editor: Tanya D’cruz

Copy Editor: Safis Editing

Technical Editor: Karan Sonawane

Proofreader: Safis Editing

Indexer: Manju Arasan

Presentation Designer: Rajesh Shirsath

Developer Relations Marketing Executive: Anamika Singh

First published: October 2024

Production reference: 4070725

Published by Packt Publishing Ltd.

Grosvenor House

11 St Paul’s Square

Birmingham

B3 1RB, UK.

ISBN 978-1-83620-007-9

www.packt.com

Forewords

As my co-founder at Hugging Face, Clement Delangue, and I often say, AI is becoming the default way of building technology.

Over the past 3 years, LLMs have already had a profound impact on technology, and they are bound to have an even greater impact in the coming 5 years. They will be embedded in more and more products and, I believe, at the center of any human activity based on knowledge or creativity.

For instance, coders are already leveraging LLMs and changing the way they work, focusing on higher-order thinking and tasks while collaborating with machines. Studio musicians rely on AI-powered tools to explore the musical creativity space faster. Lawyers are increasing their impact through retrieval-augmented generation (RAG) and large databases of case law.

At Hugging Face, we’ve always advocated for a future where not just one company or a small number of scientists control the AI models used by the rest of the population, but instead for a future where as many people as possible—from as many different backgrounds as possible—are capable of diving into how cutting-edge machine learning models actually work.

Maxime Labonne and Paul Iusztin have been instrumental in this movement to democratize LLMs by writing this book and making sure that as many people as possible can not only use them but also adapt them, fine-tune them, quantize them, and make them efficient enough to actually deploy in the real world.

Their work is essential, and I’m glad they are making this resource available to the community. This expands the convex hull of human knowledge.

Julien Chaumond

Co-founder and CTO, Hugging Face

As someone deeply immersed in the world of machine learning operations, I’m thrilled to endorse The LLM Engineer’s Handbook. This comprehensive guide arrives at a crucial time when the demand for LLM expertise is skyrocketing across industries.

What sets this book apart is its practical, end-to-end approach. By walking readers through the creation of an LLM Twin, it bridges the often daunting gap between theory and real-world application. From data engineering and model fine-tuning to advanced topics like RAG pipelines and inference optimization, the authors leave no stone unturned.

I’m particularly impressed by the emphasis on MLOps and LLMOps principles. As organizations increasingly rely on LLMs, understanding how to build scalable, reproducible, and robust systems is paramount. The inclusion of orchestration strategies and cloud integration showcases the authors’ commitment to equipping readers with truly production-ready skills.

Whether you’re a seasoned ML practitioner looking to specialize in LLMs or a software engineer aiming to break into this exciting field, this handbook provides the perfect blend of foundational knowledge and cutting-edge techniques. The clear explanations, practical examples, and focus on best practices make it an invaluable resource for anyone serious about mastering LLM engineering.

In an era where AI is reshaping industries at breakneck speed, The LLM Engineer’s Handbook stands out as an essential guide for navigating the complexities of large language models. It’s not just a book; it’s a roadmap to becoming a proficient LLM engineer in today’s AI-driven landscape.

Hamza Tahir

Co-founder and CTO, ZenML

The LLM Engineer’s Handbook serves as an invaluable resource for anyone seeking a hands-on understanding of LLMs. Through practical examples and a comprehensive exploration of the LLM Twin project, the author effectively demystifies the complexities of building and deploying production-level LLM applications.

One of the book’s standout features is its use of the LLM Twin project as a running example. This AI character, designed to emulate the writing style of a specific individual, provides a tangible illustration of how LLMs can be applied in real-world scenarios.

The author skillfully guides readers through the essential tools and technologies required for LLM development, including Hugging Face, ZenML, Comet, Opik, MongoDB, and Qdrant. Each tool is explained in detail, making it easy for readers to understand their functions and how they can be integrated into an LLM pipeline.

LLM Engineer’s Handbook also covers a wide range of topics related to LLM development, such as data collection, fine-tuning, evaluation, inference optimization, and MLOps. Notably, the chapters on supervised fine-tuning, preference alignment, and Retrieval Augmented Generation (RAG) provide in-depth insights into these critical aspects of LLM development.

A particular strength of this book lies in its focus on practical implementation. The author excels at providing concrete examples and guidance on how to optimize inference pipelines and deploy LLMs effectively. This makes the book a valuable resource for both researchers and practitioners.

This book is highly recommended for anyone interested in learning about LLMs and their practical applications. By providing a comprehensive overview of the tools, techniques, and best practices involved in LLM development, the authors have created a valuable resource that will undoubtedly be a reference for many LLM Engineers

Antonio Gulli

Senior Director, Google

Contributors

About the authors

Paul Iusztin is a senior ML and MLOps engineer with over seven years of experience building GenAI, Computer Vision and MLOps solutions. His latest contribution was at Metaphysic, where he served as one of their core engineers in taking large neural networks to production. He previously worked at CoreAI, Everseen, and Continental. He is the Founder of Decoding ML, an educational channel on production-grade ML that provides posts, articles, and open-source courses to help others build real-world ML systems. 

Maxime Labonne is the Head of Post-Training at Liquid AI. He holds a PhD. in ML from the Polytechnic Institute of Paris and is recognized as a Google Developer Expert in AI/ML. As an active blogger, he has made significant contributions to the open-source community, including the LLM Course on GitHub, tools such as LLM AutoEval, and several state-of-the-art models like NeuralDaredevil. He is the author of the best-selling book Hands-On Graph Neural Networks Using Python, published by Packt.

I want to thank my family and partner. Your unwavering support and patience made this book possible.

About the reviewer

Rany ElHousieny is an AI solutions architect and AI engineering manager with over two decades of experience in AI, NLP, and ML. Throughout his career, he has focused on the development and deployment of AI models, authoring multiple articles on AI systems architecture and ethical AI deployment. He has led groundbreaking projects at companies like Microsoft, where he spearheaded advancements in NLP and the Language Understanding Intelligent Service (LUIS). Currently, he plays a pivotal role at Clearwater Analytics, driving innovation in GenAI and AI-driven financial and investment management solutions.

I would like to thank Clearwater Analytics for providing a supportive and learning environment that fosters growth and innovation. The vision of our leaders, always staying ahead with the latest technologies, has been a constant source of inspiration. Their commitment to AI advancements made my experience of reviewing this book insightful and enriching. Special thanks to my family for their ongoing encouragement throughout this journey.

Join our book’s Discord space

Join our community’s Discord space for discussions with the authors and other readers:

https://packt.link/llmeng

Share your thoughts

Once you’ve read LLM Engineer’s Handbook, First Edition, we’d love to hear your thoughts! Please click here to go straight to the Amazon review page for this book and share your feedback.

Your review is important to us and the tech community and will help us make sure we’re delivering excellent quality content.

Making the Most Out of This Book – Get to Know Your Free Benefits

Unlock exclusive free benefits that come with your purchase, thoughtfully crafted to supercharge your learning journey and help you learn without limits.

https://www.packtpub.com/unlock/9781836200079

Note: Have your purchase invoice ready before you begin.

Figure 1.1: Next-Gen Reader, AI Assistant (Beta), and Free PDF access

Enhanced reading experience with our Next-gen Reader:

Multi-device progress sync: Learn from any device with seamless progress sync.

Highlighting and Notetaking: Turn your reading into lasting knowledge.

Bookmarking: Revisit your most important learnings anytime.

Dark mode: Focus with minimal eye strain by switching to dark or sepia modes.

Learn smarter using our AI assistant (Beta):

Summarize it: Summarize key sections or an entire chapter.

AI code explainers: In Packt Reader, click the “Explain” button above each code block for AI-powered code explanations.

Note: AI Assistant is part of next-gen Packt Reader and is still in beta.

Learn anytime, anywhere:

Access your content offline with DRM-free PDF and ePub versions—compatible with your favorite e-readers.

Unlock Your Book’s Exclusive Benefits

Your copy of this book comes with the following exclusive benefits:

Next-gen Packt Reader

AI assistant (beta)

DRM-free PDF/ePub downloads

Use the following guide to unlock them if you haven’t already. The process takes just a few minutes and needs to be done only once.

How to unlock these benefits in three easy steps

Step 1

Have your purchase invoice for this book ready, as you’ll need it in Step 3. If you received a physical invoice, scan it on your phone and have it ready as either a PDF, JPG, or PNG.

For more help on finding your invoice, visit https://www.packtpub.com/unlock-benefits/help.

Note: Bought this book directly from Packt? You don’t need an invoice. After completing Step 2, you can jump straight to your exclusive content.

Step 2

Scan the following QR code or visit https://www.packtpub.com/unlock/9781836200079:

Step 3

Sign in to your Packt account or create a new one for free. Once you’re logged in, upload your invoice. It can be in PDF, PNG, or JPG format and must be no larger than 10 MB. Follow the rest of the instructions on the screen to complete the process.

Need help?

If you get stuck and need help, visit https://www.packtpub.com/unlock-benefits/help for a detailed FAQ on how to find your invoices and more. The following QR code will take you to the help page directly:

Note: If you are still facing issues, reach out to [email protected].

1

Understanding the LLM Twin Concept and Architecture

By the end of this book, we will have walked you through the journey of building an end-to-end large language model (LLM) product. We firmly believe that the best way to learn about LLMs and production machine learning (ML) is to get your hands dirty and build systems. This book will show you how to build an LLM Twin, an AI character that learns to write like a particular person by incorporating its style, voice, and personality into an LLM. Using this example, we will walk you through the complete ML life cycle, from data gathering to deployment and monitoring. Most of the concepts learned while implementing your LLM Twin can be applied in other LLM-based or ML applications.

When starting to implement a new product, from an engineering point of view, there are three planning steps we must go through before we start building. First, it is critical to understand the problem we are trying to solve and what we want to build. In our case, what exactly is an LLM Twin, and why build it? This step is where we must dream and focus on the “Why.” Secondly, to reflect a real-world scenario, we will design the first iteration of a product with minimum functionality. Here, we must clearly define the core features required to create a working and valuable product. The choices are made based on the timeline, resources, and team’s knowledge. This is where we bridge the gap between dreaming and focusing on what is realistic and eventually answer the following question: “What are we going to build?”.

Finally, we will go through a system design step, laying out the core architecture and design choices used to build the LLM system. Note that the first two components are primarily product-related, while the last one is technical and focuses on the “How.”

These three steps are natural in building a real-world product. Even if the first two do not require much ML knowledge, it is critical to go through them to understand “how” to build the product with a clear vision. In a nutshell, this chapter covers the following topics:

Understanding the LLM Twin conceptPlanning the MVP of the LLM Twin productBuilding ML systems with feature/training/inference pipelinesDesigning the system architecture of the LLM Twin

By the end of this chapter, you will have a clear picture of what you will learn to build throughout the book.

Understanding the LLM Twin concept

The first step is to have a clear vision of what we want to create and why it’s valuable to build it. The concept of an LLM Twin is new. Thus, before diving into the technical details, it is essential to understand what it is, what we should expect from it, and how it should work. Having a solid intuition of your end goal makes it much easier to digest the theory, code, and infrastructure presented in this book.

What is an LLM Twin?

In a few words, an LLM Twin is an AI character that incorporates your writing style, voice, and personality into an LLM, which is a complex AI model. It is a digital version of yourself projected into an LLM. Instead of a generic LLM trained on the whole internet, an LLM Twin is fine-tuned on yourself. Naturally, as an ML model reflects the data it is trained on, this LLM will incorporate your writing style, voice, and personality. We intentionally used the word “projected.” As with any other projection, you lose a lot of information along the way. Thus, this LLM will not be you; it will copy the side of you reflected in the data it was trained on.

It is essential to understand that an LLM reflects the data it was trained on. If you feed it Shakespeare, it will start writing like him. If you train it on Billie Eilish, it will start writing songs in her style. This is also known as style transfer. This concept is prevalent in generating images, too. For example, let’s say you want to create a cat image using Van Gogh’s style. We will leverage the style transfer strategy, but instead of choosing a personality, we will do it on our own persona.

To adjust the LLM to a given style and voice along with fine-tuning, we will also leverage various advanced retrieval-augmented generation (RAG) techniques to condition the autoregressive process with previous embeddings of ourselves.

We will explore the details in Chapter 5 on fine-tuning and Chapters 4 and 9 on RAG, but for now, let’s look at a few examples to intuitively understand what we stated previously.

Here are some scenarios of what you can fine-tune an LLM on to become your twin:

LinkedIn posts and X threads: Specialize the LLM in writing social media content.Messages with your friends and family: Adapt the LLM to an unfiltered version of yourself.Academic papers and articles: Calibrate the LLM in writing formal and educative content.Code: Specialize the LLM in implementing code as you would.

All the preceding scenarios can be reduced to one core strategy: collecting your digital data (or some parts of it) and feeding it to an LLM using different algorithms. Ultimately, the LLM reflects the voice and style of the collected data. Easy, right?

Unfortunately, this raises many technical and moral issues. First, on the technical side, how can we access this data? Do we have enough digital data to project ourselves into an LLM? What kind of data would be valuable? Secondly, on the moral side, is it OK to do this in the first place? Do we want to create a copycat of ourselves? Will it write using our voice and personality, or just try to replicate it?

Remember that the role of this section is not to bother with the “What” and “How” but with the “Why.” Let’s understand why it makes sense to have your LLM Twin, why it can be valuable, and why it is morally correct if we frame the problem correctly.

Why building an LLM Twin matters

As an engineer (or any other professional career), building a personal brand is more valuable than a standard CV. The biggest issue with creating a personal brand is that writing content on platforms such as LinkedIn, X, or Medium takes a lot of time. Even if you enjoy writing and creating content, you will eventually run out of inspiration or time and feel like you need assistance. We don’t want to transform this section into a pitch, but we have to understand the scope of this product/project clearly.

We want to build an LLM Twin to write personalized content on LinkedIn, X, Instagram, Substack, and Medium (or other blogs) using our style and voice. It will not be used in any immoral scenarios, but it will act as your writing co-pilot. Based on what we will teach you in this book, you can get creative and adapt it to various use cases, but we will focus on the niche of generating social media content and articles. Thus, instead of writing the content from scratch, we can feed the skeleton of our main idea to the LLM Twin and let it do the grunt work.

Ultimately, we will have to check whether everything is correct and format it to our liking (more on the concrete features in the Planning the MVP of the LLM Twin product section). Hence, we project ourselves into a content-writing LLM Twin that will help us automate our writing process. It will likely fail if we try to use this particular LLM in a different scenario, as this is where we will specialize the LLM through fine-tuning, prompt engineering, and RAG.

So, why does building an LLM Twin matter? It helps you do the following:

Create your brandAutomate the writing processBrainstorm new creative ideas

What’s the difference between a co-pilot and an LLM Twin?

A co-pilot and digital twin are two different concepts that work together and can be combined into a powerful solution:

The co-pilot is an AI assistant or tool that augments human users in various programming, writing, or content creation tasks.The twin serves as a 1:1 digital representation of a real-world entity, often using AI to bridge the gap between the physical and digital worlds. For instance, an LLM Twin is an LLM that learns to mimic your voice, personality, and writing style.

With these definitions in mind, a writing and content creation AI assistant who writes like you is your LLM Twin co-pilot.

Also, it is critical to understand that building an LLM Twin is entirely moral. The LLM will be fine-tuned only on our personal digital data. We won’t collect and use other people’s data to try to impersonate anyone’s identity. We have a clear goal in mind: creating our personalized writing copycat. Everyone will have their own LLM Twin with restricted access.

Of course, many security concerns are involved, but we won’t go into that here as it could be a book in itself.

Why not use ChatGPT (or another similar chatbot)?

This subsection will refer to using ChatGPT (or another similar chatbot) just in the context of generating personalized content.

We have already provided the answer. ChatGPT is not personalized to your writing style and voice. Instead, it is very generic, unarticulated, and wordy. Maintaining an original voice is critical for long-term success when building your brand. Thus, directly using ChatGPT or Gemini will not yield the most optimal results. Even if you are OK with sharing impersonalized content, mindlessly using ChatGPT can result in the following:

Misinformation due to hallucination: Manually checking the results for hallucinations or using third-party tools to evaluate your results is a tedious and unproductive experience.Tedious manual prompting: You must manually craft your prompts and inject external information, which is a tiresome experience. Also, the generated answers will be hard to replicate between multiple sessions as you don’t have complete control over your prompts and injected data. You can solve part of this problem using an API and a tool such as LangChain, but you need programming experience to do so.

From our experience, if you want high-quality content that provides real value, you will spend more time debugging the generated text than writing it yourself.

The key of the LLM Twin stands in the following:

What data we collectHow we preprocess the dataHow we feed the data into the LLMHow we chain multiple prompts for the desired resultsHow we evaluate the generated content

The LLM itself is important, but we want to highlight that using ChatGPT’s web interface is exceptionally tedious in managing and injecting various data sources or evaluating the outputs. The solution is to build an LLM system that encapsulates and automates all the following steps (manually replicating them each time is not a long-term and feasible solution):

Data collectionData preprocessingData storage, versioning, and retrievalLLM fine-tuningRAGContent generation evaluation

Note that we never said not to use OpenAI’s GPT API, just that the LLM framework we will present is LLM-agnostic. Thus, if it can be manipulated programmatically and exposes a fine-tuning interface, it can be integrated into the LLM Twin system we will learn to build. The key to most successful ML products is to be data-centric and make your architecture model-agnostic. Thus, you can quickly experiment with multiple models on your specific data.

Planning the MVP of the LLM Twin product

Now that we understand what an LLM Twin is and why we want to build it, we must clearly define the product’s features. In this book, we will focus on the first iteration, often labeled the minimum viable product (MVP), to follow the natural cycle of most products. Here, the main objective is to align our ideas with realistic and doable business objectives using the available resources to produce the product. Even as an engineer, as you grow up in responsibilities, you must go through these steps to bridge the gap between the business needs and what can be implemented.

What is an MVP?

An MVP is a version of a product that includes just enough features to draw in early users and test the viability of the product concept in the initial stages of development. Usually, the purpose of the MVP is to gather insights from the market with minimal effort.

An MVP is a powerful strategy because of the following reasons:

Accelerated time-to-market: Launch a product quickly to gain early tractionIdea validation: Test it with real users before investing in the full development of the productMarket research: Gain insights into what resonates with the target audienceRisk minimization: Reduces the time and resources needed for a product that might not achieve market success

Sticking to the V in MVP is essential, meaning the product must be viable. The product must provide an end-to-end user journey without half-implemented features, even if the product is minimal. It must be a working product with a good user experience that people will love and want to keep using to see how it evolves to its full potential.

Defining the LLM Twin MVP

As a thought experiment, let’s assume that instead of building this project for this book, we want to make a real product. In that case, what are our resources? Well, unfortunately, not many:

We are a team of three people with two ML engineers and one ML researcherOur laptopsPersonal funding for computing, such as training LLMsOur enthusiasm

As you can see, we don’t have many resources. Even if this is just a thought experiment, it reflects the reality for most start-ups at the beginning of their journey. Thus, we must be very strategic in defining our LLM Twin MVP and what features we want to pick. Our goal is simple: we want to maximize the product’s value relative to the effort and resources poured into it.

To keep it simple, we will build the features that can do the following for the LLM Twin:

Collect data from your LinkedIn, Medium, Substack, and GitHub profilesFine-tune an open-source LLM using the collected dataPopulate a vector database (DB) using our digital data for RAGCreate LinkedIn posts leveraging the following:User promptsRAG to reuse and reference old contentNew posts, articles, or papers as additional knowledge to the LLMHave a simple web interface to interact with the LLM Twin and be able to do the following:Configure your social media links and trigger the collection stepSend prompts or links to external resources

That will be the LLM Twin MVP. Even if it doesn’t sound like much, remember that we must make this system cost effective, scalable, and modular.

Even if we focus only on the core features of the LLM Twin defined in this section, we will build the product with the latest LLM research and best software engineering and MLOps practices in mind. We aim to show you how to engineer a cost-effective and scalable LLM application.

Until now, we have examined the LLM Twin from the users’ and businesses’ perspectives. The last step is to examine it from an engineering perspective and define a development plan to understand how to solve it technically. From now on, the book’s focus will be on the implementation of the LLM Twin.

Building ML systems with feature/training/inference pipelines

Before diving into the specifics of the LLM Twin architecture, we must understand an ML system pattern at the core of the architecture, known as the feature/training/inference (FTI) architecture. This section will present a general overview of the FTI pipeline design and how it can structure an ML application.

Let’s see how we can apply the FTI pipelines to the LLM Twin architecture.

The problem with building ML systems

Building production-ready ML systems is much more than just training a model. From an engineering point of view, training the model is the most straightforward step in most use cases. However, training a model becomes complex when deciding on the correct architecture and hyperparameters. That’s not an engineering problem but a research problem.

At this point, we want to focus on how to design a production-ready architecture. Training a model with high accuracy is extremely valuable, but just by training it on a static dataset, you are far from deploying it robustly. We have to consider how to do the following:

Ingest, clean, and validate fresh dataTraining versus inference setupsCompute and serve features in the right environmentServe the model in a cost-effective wayVersion, track, and share the datasets and modelsMonitor your infrastructure and modelsDeploy the model on a scalable infrastructureAutomate the deployments and training

These are the types of problems an ML or MLOps engineer must consider, while the research or data science team is often responsible for training the model.

Figure 1.1: Common elements from an ML system

The preceding figure shows all the components the Google Cloud team suggests that a mature ML and MLOps system requires. Along with the ML code, there are many moving pieces. The rest of the system comprises configuration, automation, data collection, data verification, testing and debugging, resource management, model analysis, process and metadata management, serving infrastructure, and monitoring. The point is that there are many components we must consider when productionizing an ML model.

Thus, the critical question is this: How do we connect all these components into a single homogenous system? We must create a boilerplate for clearly designing ML systems to answer that question.

Similar solutions exist for classic software. For example, if you zoom out, most software applications can be split between a DB, business logic, and UI layer. Every layer can be as complex as needed, but at a high-level overview, the architecture of standard software can be boiled down to the previous three components.

Do we have something similar for ML applications? The first step is to examine previous solutions and why they are unsuitable for building scalable ML systems.

The issue with previous solutions

In Figure 1.2, you can observe the typical architecture present in most ML applications. It is based on a monolithic batch architecture that couples the feature creation, model training, and inference into the same component. By taking this approach, you quickly solve one critical problem in the ML world: the training-serving skew. The training-serving skew happens when the features passed to the model are computed differently at training and inference time.

In this architecture, the features are created using the same code. Hence, the training-serving skew issue is solved by default. This pattern works fine when working with small data. The pipeline runs on a schedule in batch mode, and the predictions are consumed by a third-party application such as a dashboard.

Figure 1.2: Monolithic batch pipeline architecture

Unfortunately, building a monolithic batch system raises many other issues, such as the following:

Features are not reusable (by your system or others)If the data increases, you have to refactor the whole code to support PySpark or RayIt’s hard to rewrite the prediction module in a more efficient language such as C++, Java, or RustIt’s hard to share the work between multiple teams between the features, training, and prediction modulesIt’s impossible to switch to streaming technology for real-time training

In Figure 1.3, we can see a similar scenario for a real-time system. This use case introduces another issue in addition to what we listed before. To make the predictions, we have to transfer the whole state through the client request so the features can be computed and passed to the model.

Consider the scenario of computing movie recommendations for a user. Instead of simply passing the user ID, we must transmit the entire user state, including their name, age, gender, movie history, and more. This approach is fraught with potential errors, as the client must understand how to access this state, and it’s tightly coupled with the model service.

Another example would be when implementing an LLM with RAG support. The documents we add as context along the query represent our external state. If we didn’t store the records in a vector DB, we would have to pass them with the user query. To do so, the client must know how to query and retrieve the documents, which is not feasible. It is an antipattern for the client application to know how to access or compute the features. If you don’t understand how RAG works, we will explain it in detail in Chapters 8 and 9.

Figure 1.3: Stateless real-time architecture

In conclusion, our problem is accessing the features to make predictions without passing them at the client’s request. For example, based on our first user movie recommendation example, how can we predict the recommendations solely based on the user’s ID? Remember these questions, as we will answer them shortly.

Ultimately, on the other spectrum, Google Cloud provides a production-ready architecture, as shown in Figure 1.4. Unfortunately, even if it’s a feasible solution, it’s very complex and not intuitive. You will have difficulty understanding this if you are not highly experienced in deploying and keeping ML models in production. Also, it is not straightforward to understand how to start small and grow the system in time.

The following image is reproduced from work created and shared by Google and used according to terms described in the Creative Commons 4.0 Attribution License:

Figure 1.4: ML pipeline automation for CT (source: https://cloud.google.com/architecture/mlops-continuous-delivery-and-automation-pipelines-in-machine-learning)

But here is where the FTI pipeline architectures kick in. The following section will show you how to solve these fundamental issues using an intuitive ML design.

The solution – ML pipelines for ML systems

The solution is based on creating a clear and straightforward mind map that any team or person can follow to compute the features, train the model, and make predictions. Based on these three critical steps that any ML system requires, the pattern is known as the FTI pipeline. So, how does this differ from what we presented before?

The pattern suggests that any ML system can be boiled down to these three pipelines: feature, training, and inference (similar to the DB, business logic, and UI layers from classic software). This is powerful, as we can clearly define the scope and interface of each pipeline. Also, it’s easier to understand how the three components interact. Ultimately, we have just three instead of 20 moving pieces, as suggested in Figure 1.4, which is much easier to work with and define.

As shown in Figure 1.5, we have the feature, training, and inference pipelines. We will zoom in on each of them and understand their scope and interface.

Figure 1.5: FTI pipelines architecture

Before going into the details, it is essential to understand that each pipeline is a different component that can run on a different process or hardware. Thus, each pipeline can be written using a different technology, by a different team, or scaled differently. The key idea is that the design is very flexible to the needs of your team. It acts as a mind map for structuring your architecture.

The feature pipeline

The feature pipeline takes raw data as input, processes it, and outputs the features and labels required by the model for training or inference. Instead of directly passing them to the model, the features and labels are stored inside a feature store. Its responsibility is to store, version, track, and share the features. By saving the features in a feature store, we always have a state of our features. Thus, we can easily send the features to the training and inference pipelines.

As the data is versioned, we can always ensure that the training and inference time features match. Thus, we avoid the training-serving skew problem.

The training pipeline

The training pipeline takes the features and labels from the features stored as input and outputs a train model or models. The models are stored in a model registry. Its role is similar to that of feature stores, but this time, the model is the first-class citizen. Thus, the model registry will store, version, track, and share the model with the inference pipeline.

Also, most modern model registries support a metadata store that allows you to specify essential aspects of how the model was trained. The most important are the features, labels, and their version used to train the model. Thus, we will always know what data the model was trained on.

The inference pipeline

The inference pipeline takes as input the features and labels from the feature store and the trained model from the model registry. With these two, predictions can be easily made in either batch or real-time mode.

As this is a versatile pattern, it is up to you to decide what you do with your predictions. If it’s a batch system, they will probably be stored in a DB. If it’s a real-time system, the predictions will be served to the client who requested them. Additionally, the features, labels, and models are versioned. We can easily upgrade or roll back the deployment of the model. For example, we will always know that model v1 uses features F1, F2, and F3, and model v2 uses F2, F3, and F4. Thus, we can quickly change the connections between the model and features.

Benefits of the FTI architecture

To conclude, the most important thing you must remember about the FTI pipelines is their interface:

The feature pipeline takes in data and outputs the features and labels saved to the feature store.The training pipeline queries the features store for features and labels and outputs a model to the model registry.The inference pipeline uses the features from the feature store and the model from the model registry to make predictions.

It doesn’t matter how complex your ML system gets, these interfaces will remain the same.

Now that we understand better how the pattern works, we want to highlight the main benefits of using this pattern:

As you have just three components, it is intuitive to use and easy to understand.Each component can be written into its tech stack, so we can quickly adapt them to specific needs, such as big or streaming data. Also, it allows us to pick the best tools for the job.As there is a transparent interface between the three components, each one can be developed by a different team (if necessary), making the development more manageable and scalable.Every component can be deployed, scaled, and monitored independently.

The final thing you must understand about the FTI pattern is that the system doesn’t have to contain only three pipelines. In most cases, it will include more. For example, the feature pipeline can be composed of a service that computes the features and one that validates the data. Also, the training pipeline can be composed of the training and evaluation components.

The FTI pipelines act as logical layers. Thus, it is perfectly fine for each to be complex and contain multiple services. However, what is essential is to stick to the same interface on how the FTI pipelines interact with each other through the feature store and model registries. By doing so, each FTI component can evolve differently, without knowing the details of each other and without breaking the system on new changes.

To learn more about the FTI pipeline pattern, consider reading From MLOps to ML Systems with Feature/Training/Inference Pipelines by Jim Dowling, CEO and co-founder of Hopsworks: https://www.hopsworks.ai/post/mlops-to-ml-systems-with-fti-pipelines. His article inspired this section.

Now that we understand the FTI pipeline architecture, the final step of this chapter is to see how it can be applied to the LLM Twin use case.

Designing the system architecture of the LLM Twin

In this section, we will list the concrete technical details of the LLM Twin application and understand how we can solve them by designing our LLM system using the FTI architecture. However, before diving into the pipelines, we want to highlight that we won’t focus on the tooling or the tech stack at this step. We only want to define a high-level architecture of the system, which is language-, framework-, platform-, and infrastructure-agnostic at this point. We will focus on each component’s scope, interface, and interconnectivity. In future chapters, we will cover the implementation details and tech stack.

Listing the technical details of the LLM Twin architecture

Until now, we defined what the LLM Twin should support from the user’s point of view. Now, let’s clarify the requirements of the ML system from a purely technical perspective:

On the data side, we have to do the following:Collect data from LinkedIn, Medium, Substack, and GitHub completely autonomously and on a scheduleStandardize the crawled data and store it in a data warehouseClean the raw dataCreate instruct datasets for fine-tuning an LLMChunk and embed the cleaned data. Store the vectorized data into a vector DB for RAG.For training, we have to do the following:Fine-tune LLMs of various sizes (7B, 14B, 30B, or 70B parameters)Fine-tune on instruction datasets of multiple sizesSwitch between LLM types (for example, between Mistral, Llama, and GPT)Track and compare experimentsTest potential production LLM candidates before deploying themAutomatically start the training when new instruction datasets are available.The inference code will have the following properties:A REST API interface for clients to interact with the LLM TwinAccess to the vector DB in real time for RAGInference with LLMs of various sizesAutoscaling based on user requestsAutomatically deploy the LLMs that pass the evaluation step.The system will support the following LLMOps features:Instruction dataset versioning, lineage, and reusabilityModel versioning, lineage, and reusabilityExperiment trackingContinuous training, continuous integration, and continuous delivery (CT/CI/CD)Prompt and system monitoring

If any technical requirement doesn’t make sense now, bear with us. To avoid repetition, we will examine the details in their specific chapter.

The preceding list is quite comprehensive. We could have detailed it even more, but at this point, we want to focus on the core functionality. When implementing each component, we will look into all the little details. But for now, the fundamental question we must ask ourselves is this: How can we apply the FTI pipeline design to implement the preceding list of requirements?

How to design the LLM Twin architecture using the FTI pipeline design

We will split the system into four core