Large Language Model-Based Solutions - Shreyas Subramanian - E-Book

Large Language Model-Based Solutions E-Book

Shreyas Subramanian

0,0
38,99 €

-100%
Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.
Mehr erfahren.
Beschreibung

Learn to build cost-effective apps using Large Language Models In Large Language Model-Based Solutions: How to Deliver Value with Cost-Effective Generative AI Applications, Principal Data Scientist at Amazon Web Services, Shreyas Subramanian, delivers a practical guide for developers and data scientists who wish to build and deploy cost-effective large language model (LLM)-based solutions. In the book, you'll find coverage of a wide range of key topics, including how to select a model, pre- and post-processing of data, prompt engineering, and instruction fine tuning. The author sheds light on techniques for optimizing inference, like model quantization and pruning, as well as different and affordable architectures for typical generative AI (GenAI) applications, including search systems, agent assists, and autonomous agents. You'll also find: * Effective strategies to address the challenge of the high computational cost associated with LLMs * Assistance with the complexities of building and deploying affordable generative AI apps, including tuning and inference techniques * Selection criteria for choosing a model, with particular consideration given to compact, nimble, and domain-specific models Perfect for developers and data scientists interested in deploying foundational models, or business leaders planning to scale out their use of GenAI, Large Language Model-Based Solutions will also benefit project leaders and managers, technical support staff, and administrators with an interest or stake in the subject.

Sie lesen das E-Book in den Legimi-Apps auf:

Android
iOS
von Legimi
zertifizierten E-Readern

Veröffentlichungsjahr: 2024

Bewertungen
0,0
0
0
0
0
0
Mehr Informationen
Mehr Informationen
Legimi prüft nicht, ob Rezensionen von Nutzern stammen, die den betreffenden Titel tatsächlich gekauft oder gelesen/gehört haben. Wir entfernen aber gefälschte Rezensionen.



Table of Contents

Cover

Table of Contents

Title Page

Introduction

GenAI APPLICATIONS AND LARGE LANGUAGE MODELS

IMPORTANCE OF COST OPTIMIZATION

MICRO CASE STUDIES

WHO IS THIS BOOK FOR?

SUMMARY

1 Introduction

OVERVIEW OF GenAI APPLICATIONS AND LARGE LANGUAGE MODELS

PATHS TO PRODUCTIONIZING GenAI APPLICATIONS

THE IMPORTANCE OF COST OPTIMIZATION

SUMMARY

2 Tuning Techniques for Cost Optimization

FINE‐TUNING AND CUSTOMIZABILITY

PARAMETER‐EFFICIENT FINE‐TUNING METHODS

COST AND PERFORMANCE IMPLICATIONS OF PEFT METHODS

SUMMARY

3 Inference Techniques for Cost Optimization

INTRODUCTION TO INFERENCE TECHNIQUES

PROMPT ENGINEERING

CACHING WITH VECTOR STORES

CHAINS FOR LONG DOCUMENTS

SUMMARIZATION

BATCH PROMPTING FOR EFFICIENT INFERENCE

MODEL OPTIMIZATION METHODS

PARAMETER‐EFFICIENT FINE‐TUNING METHODS

COST AND PERFORMANCE IMPLICATIONS

SUMMARY

REFERENCES

4 Model Selection and Alternatives

INTRODUCTION TO MODEL SELECTION

MOTIVATING EXAMPLE: THE TALE OF TWO MODELS

THE ROLE OF COMPACT AND NIMBLE MODELS

EXAMPLES OF SUCCESSFUL SMALLER MODELS

DOMAIN‐SPECIFIC MODELS

THE POWER OF PROMPTING WITH GENERAL‐PURPOSE MODELS

SUMMARY

5 Infrastructure and Deployment Tuning Strategies

INTRODUCTION TO TUNING STRATEGIES

HARDWARE UTILIZATION AND BATCH TUNING

INFERENCE ACCELERATION TOOLS

MONITORING AND OBSERVABILITY

SUMMARY

CONCLUSION

BALANCING PERFORMANCE AND COST

FUTURE TRENDS IN GenAI APPLICATIONS

SUMMARY

INDEX

Copyright

Dedication

ABOUT THE AUTHOR

ABOUT THE TECHNICAL EDITOR

End User License Agreement

List of Tables

Chapter 1

TABLE 1.1: GPT 3.5 TPS benchmark results

TABLE 1.2: Percentage of requests served within a certain time period in se...

TABLE 1.3: Benchmark results for vector DB

Chapter 2

TABLE 2.1: Comparing different GPUs for full and LoRA‐based training on ful...

Chapter 3

TABLE 3.1: Estimated model size in GBs

Chapter 4

TABLE 4.1: Compute effort for the Microsoft Phi series of models

TABLE 4.2: Performance of Phi 2 model compared to larger models such as Mis...

Chapter 5

TABLE 5.1: Experiment recording memory required for increasing sizes of mod...

Conclusion

TABLE 1 GenAI starter team

List of Illustrations

Introduction

FIGURE 1:   Google Trends chart of interest over time for the term

Generative

...

Chapter 1

FIGURE 1.1: Evolutionary tree of language models(see Rice University /

https

...

FIGURE 1.2: Evolutionary tree of human brain structure (see André M.M. Sousa...

FIGURE 1.3: Sequence diagram of the request flow through a GenAI chatbot app...

FIGURE 1.4: Sequence diagram of the request flow through a GenAI chatbot app...

FIGURE 1.5: Tokens generated for the text prompt. Notice how the misspelled ...

FIGURE 1.6: Token IDs of the tokens generated from the original text prompt ...

FIGURE 1.7: Tokens generated for the long example sentence

FIGURE 1.8: Vector DB benchmarking architecture

FIGURE 1.9: Snapshot 1 of the cost calculator showing costs per hour for var...

FIGURE 1.10: Snapshot 2 of the cost calculator showing costs per hour for va...

Chapter 2

FIGURE 2.1: IsoLoss contours from the Chinchilla paper, which shows models s...

FIGURE 2.2: Number of tokens to train with to reach ideal loss for a given c...

FIGURE 2.3: The basic decoder‐only block of popular models such as GPT

FIGURE 2.4: Modifying the original decoder block for prompt tuning

FIGURE 2.5: Modifying the original decoder block for ensemble prompt tuning...

FIGURE 2.6: Modifying the original decoder block for prefix tuning

FIGURE 2.7: Prompt template being converted to pseudo‐tokens for P‐tuning

FIGURE 2.8: Modifying the original decoder block for P‐tuning

FIGURE 2.9: Modifying the original decoder block for IA3

FIGURE 2.10: Modifying the original decoder block for MTP

FIGURE 2.11: Modifying the original decoder block for LoRA

Chapter 3

FIGURE 3.1: Sequence diagram for typical RAG systems

FIGURE 3.2: Snapshot of the first page of the 2022 Amazon 10‐K filing

FIGURE 3.3: Claude response for the improved prompt

FIGURE 3.4: Basic pattern of caching requests with a vector store for infere...

FIGURE 3.5: Caching requests when there are multiple LLMs behind the scene

FIGURE 3.6: Sequential processing workflow diagram for long documents

FIGURE 3.7: Parallel processing workflow diagram for long documents

FIGURE 3.8: How to dynamically use multiple adapters in inference

FIGURE 3.9: An adapter predictor component is introduced to automatically se...

Chapter 4

FIGURE 4.1: Performance of Zephyr 7B model compared to other models on langu...

FIGURE 4.2: Performance of CogVLM model compared to several other models on ...

FIGURE 4.3: Pearson correlation between human annotator scores and those fro...

FIGURE 4.4: Visualizing the contours of the Chinchilla scaling laws with A =...

FIGURE 4.5: Relative performance of the Gemini Nano models compared to the l...

FIGURE 4.6: Output from a generic tokenizer on financial data

FIGURE 4.7: Comparing the outputs from the generic tokenizer, and a trained ...

FIGURE 4.8: Compute capacity required to train different models. Adapted fro...

FIGURE 4.9: Setting up Autotrain on huggingface

FIGURE 4.10: Creating a training job in Autotrain on huggingface

FIGURE 4.11: Creating a custom model in Amazon Bedrock

FIGURE 4.12: Creating a custom model in Google Vertex AI

FIGURE 4.13: Creating a custom model in Microsoft Azure

FIGURE 4.14: Evaluating domain‐specific models: BloomberGPT results on finan...

FIGURE 4.15: Evaluating Medprompt with GPT‐4 against Med Palm2

FIGURE 4.16: Increasing accuracy using multiple strategies of Medprompt with...

Chapter 5

FIGURE 5.1: Self‐attention calculation visualized (Alammar, J (2018) / Jay A...

FIGURE 5.2: Self‐attention calculation happening within the head of the tran...

FIGURE 5.3: Self‐attention using bertviz neuron view within a head of the tr...

FIGURE 5.4: 0 MB GPU memory utilization at the beginning

FIGURE 5.5: 119 MB GPU memory utilization after loading a size 1 tensor with...

FIGURE 5.6: 651 MB GPU after initializing a GPTJ model with 137M parameters...

FIGURE 5.7: 5689 MB GPU after initializing an OPT model with 1.3 billion par...

FIGURE 5.8: GPU memory space occupied by the model weights compared to the K...

FIGURE 5.9: Schematic showing how PagedAttention works in vLLM. This can be ...

FIGURE 5.10: Collocation of multiple models can lead to better performance (...

FIGURE 5.11: Results comparing the performance of S3 to another system ORCA,...

FIGURE 5.12: KV cache in a streaming context with StreamingLLM

FIGURE 5.13: High‐level guidance for selecting parameters in the serving pro...

FIGURE 5.14: High‐level guidance for selecting parameters in the serving pro...

FIGURE 5.15: Parallel coordinates plot visualizing the inference HPO results...

FIGURE 5.16: Steps in LLMOps

Guide

Cover

Table of Contents

Title Page

Copyright

Dedication

ABOUT THE AUTHOR

ABOUT THE TECHNICAL EDITOR

Introduction

Begin Reading

CONCLUSION

INDEX

End User License Agreement

Pages

v

xix

xx

xxi

xxii

xxiii

xxiv

xxv

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

61

62

63

64

65

66

67

68

69

70

71

72

73

74

75

76

77

78

79

80

81

82

83

84

85

86

87

88

89

90

91

92

93

94

95

96

97

98

99

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125

126

127

128

129

130

131

132

133

134

135

136

137

138

139

140

141

142

143

144

145

146

147

148

149

150

151

152

153

154

155

156

157

158

159

160

161

163

164

165

166

167

168

169

170

171

172

173

174

175

176

177

178

179

180

181

182

183

184

185

186

187

188

189

190

vi

vii

ix

x

xi

191

Large Language Model–Based Solutions

HOW TO DELIVER VALUE WITH COST-EFFECTIVE GENERATIVE AI APPLICATIONS

 

Shreyas Subramanian

 

 

 

 

 

Introduction

WHAT'S IN THIS CHAPTER?

GenAI Applications and Large Language Models

Importance of Cost Optimization

Micro Case Studies

Who Is This Book For?

GenAI APPLICATIONS AND LARGE LANGUAGE MODELS

Large language models (LLMs) have evolved to become a cornerstone in the domain of text‐based content generation. They can produce coherent and contextually relevant text for a variety of applications, making them invaluable assets in today's digital landscape. One notable example is OpenAI’s GPT‐4, which reportedly ranked in the 90th percentile of human test takers on the Uniform BAR Examination, showcasing its advanced language understanding and generation capabilities​. Generative AI tools (like ChatGPT, for example) may use LLMs, but also other kinds of large models (e.g., foundational vision models). These models serve as the backbone for many modern applications, facilitating a multitude of tasks that would otherwise require substantial human effort for building bespoke, application‐specific models. The capabilities of these models to understand, interpret, and generate human‐like text are not only pushing the boundaries of what's achievable with AI but also unlocking new avenues for innovation across different sectors. To reemphasize what's already obvious, Figure 1 shows Google Trends interest over time for the term Generative AI worldwide.

FIGURE 1:  Google Trends chart of interest over time for the term Generative AI worldwide

Generative AI (GenAI) and LLMs represent two interlinked domains within artificial intelligence, both focusing on content generation but from slightly different angles. GenAI encompasses a broader category of AI technologies aimed at creating original content. While LLMs excel at text processing and production, GenAI places a broader emphasis on creativity and content generation across different mediums. Understanding the distinctions and potential synergies between these two areas is crucial to fully harness the benefits of AI in various applications, ranging from automated customer service and content creation to more complex tasks such as code generation and debugging​. This field has seen rapid advancements, enabling enterprises to automate intelligence across multiple domains and significantly accelerate innovation in AI development​​. On the other hand, LLMs, being a subset of GenAI, are specialized in processing and generating text. They have demonstrated remarkable capabilities, notably in natural language processing tasks and beyond, with a substantial influx of research contributions propelling their success​.

The proliferation of LLMs and GenAI applications has been fueled by both competitive advancements and collaborative efforts within the AI community, with various stakeholders including tech giants, academic institutions, and individual researchers contributing to the rapid progress witnessed in recent years​. In the following sections, we will talk about the importance of cost optimization in this era of LLMs, explore a few case studies of successful companies in this area, and describe the scope of the rest of the book.

IMPORTANCE OF COST OPTIMIZATION

The importance of cost optimization in the development and operation of GenAI applications and LLMs cannot be understated. Cost can ultimately make or break the progress toward a company's adoption of GenAI. This necessity stems from various aspects of these technologically advanced models. GenAI and LLMs are resource‐intensive by nature, necessitating substantial computational resources to perform complex tasks. Training state‐of‐the‐art LLMs such as OpenAI's GPT‐3 can involve weeks or even months of high‐performance computing. This extensive computational demand translates into increased costs for organizations leveraging cloud infrastructure and operating models.

The financial burden of developing GenAI models is considerable. For instance, McKinsey estimates that developing a single generative AI model costs up to $200 million, with up to $10 million required to customize an existing model with internal data and up to $2 million needed for deployment. Moreover, the cost per token generated during inference for newer models like GPT‐4 is estimated to be 30 times more than that of GPT‐3.5, showing a trend of rising costs with advancements in model capabilities. The daily operational cost for running large models like ChatGPT is significant as well, with OpenAI reported to spend $700,000 daily to maintain the model's operations.

GenAI models require high utilization of specialized hardware like graphics processing units (GPUs) and tensor processing units (TPUs) to accelerate model training and inference. These specialized hardware units come at a premium cost in cloud infrastructure, further driving up the expenses. Companies trying to do this on‐premises, without the help of cloud providers, may need a significant, up‐front capital investment.

Beyond compute requirements, large‐scale, high‐performance data storage is imperative for training and fine‐tuning GenAI models, with the storage and management of extensive datasets incurring additional cloud storage costs. As AI models evolve and adapt to ever‐increasing stores of data (like the Internet), ongoing storage requirements further contribute to overall expenses. This is why scalability poses a significant challenge in cost optimization. Rapid scaling to accommodate the resource demands of GenAI applications can lead to cost inefficiencies if not managed effectively. Overscaling can result in underutilized resources and unnecessary expenditure, whereas underscaling may hinder model performance and productivity.

Strategies to optimize costs while scaling GenAI in large organizations include prioritizing education across all teams, creating spaces for innovation, and reviewing internal processes to adapt for faster innovation where possible.

Pre‐training a large language model to perform fundamental tasks serves as a foundation for an AI system, which can then be fine‐tuned at a lower cost to perform a wide range of specific tasks. This approach aids in cost optimization while retaining model effectiveness for specific tasks.

Conducting a thorough cost‐value assessment to rank and prioritize GenAI implementations based on potential impact, cost, and complexity can lead to better financial management and realization of ROI in GenAI initiatives. Lastly, the most common pattern seen today is for “model providers” to spend and try to recoup their costs by providing an API and for “model consumers” to heavily optimize their costs by using GenAI model APIs without the need for any up‐front investment or even data.

Challenges and Opportunities

The pathway to cost optimization in GenAI applications with large language models is laden with both challenges and opportunities. These arise from the inherent complexities of the models and the evolving landscape of AI technologies. The following are the principal challenges and the accompanying opportunities in this domain:

Computational demands:

LLMs like GPT‐3 or BERT require substantial computational resources for training and inference. The high computational demands translate to increased operational costs and energy consumption, which may create barriers, especially for small to medium‐sized enterprises (SMEs) with limited resources.

Opportunity: The challenge of computational demands opens the door for innovation in developing more efficient algorithms, hardware accelerators, and cloud‐based solutions that can reduce the cost and energy footprint of operating LLMs.

Model complexity:

The complexity of LLMs, both in terms of architecture and the amount of training data required, presents challenges in achieving cost optimization. The model's size often correlates with its performance, with larger models generally delivering better results at the expense of increased costs.

Opportunity: This challenge catalyzes the exploration and adoption of techniques such as model pruning, quantization, and knowledge distillation that aim to reduce model size while retaining or even enhancing performance.

Data privacy and security:

Handling sensitive data securely is a paramount concern, especially in sectors such as healthcare and finance. The cost of ensuring data privacy and security while training and deploying LLMs can be significant.

Opportunity: The necessity for robust data privacy and security solutions fosters innovation in privacy‐preserving techniques, such as federated learning, differential privacy, and encrypted computation.

Scalability:

Scaling GenAI applications to accommodate growing data and user demands without a proportional increase in costs is a formidable challenge.

Opportunity: This challenge drives the advancement of scalable architectures and technologies that allow for efficient scaling, such as microservices, container orchestration, and serverless computing.

Model generalizability and domain adaptation:

Achieving high performance on domain‐specific tasks often requires fine‐tuning LLMs with additional data, which can be cost‐intensive.

Opportunity: This creates a niche for developing techniques and frameworks that facilitate efficient domain adaptation and transfer learning, enabling cost‐effective customization of LLMs for various domain‐specific applications.

Evolving regulatory landscape:

The regulatory landscape surrounding AI and data usage is continually evolving, potentially incurring compliance costs.

Opportunity: The dynamic regulatory environment stimulates the development of adaptable AI systems and compliance monitoring tools that can mitigate the risks and costs associated with regulatory compliance.

Each of these challenges, while posing hurdles, concurrently lays the groundwork for innovation and advancements that can significantly contribute to cost optimization in GenAI applications with large foundational models. The confluence of these challenges is an important factor in propelling the field of GenAI forward, fostering the development of cost‐effective, efficient, and robust GenAI packages, software, and solutions. The myriad of factors contributing to the high costs in the development, deployment, and operation of GenAI and LLMs necessitates a structured approach toward cost optimization to ensure the sustainable adoption and scalability of these transformative technologies. This book dives into the details of what makes GenAI applications powerful but costly and highlights several aspects of balancing performance with cost to ensure the success of organizations that make use of large foundational models. Next, we will look at a few case studies as motivation for the rest of the book.

MICRO CASE STUDIES

This section focuses on three different companies that have “walked the walk” in terms of putting large models in production. What “in production” means is different for different companies, as you will see in the following studies. The case studies should provide a glimpse into the kind of effort and investment required to be involved in the deployment and production usage of foundational models like LLMs in the form of GenAI applications.

OpenAI: Leading the Way

Founded in 2015, OpenAI embarked on a mission to ensure that artificial general intelligence (AGI) benefits all of humanity. Initially operating as a nonprofit, it pledged to collaborate freely with other institutions and researchers, making its patents and research public. The early years saw the launch of OpenAI Gym and Universe, platforms dedicated to reinforcing learning research and measuring AI's general intelligence across a spectrum of tasks.

As AI technology advanced, OpenAI rolled out GPT‐1 in 2018, marking its venture into robust language models. GPT‐1, with 117 million parameters, showcased the potential of generating coherent language from prompts, although it had its limitations such as generating repetitive text. Addressing these challenges, OpenAI unveiled GPT‐2 in 2019 with 1.5 billion parameters, offering improved text generation capabilities. In 2020, the release of GPT‐3, a behemoth with 175 billion parameters, set a new standard in the NLP realm. GPT‐3's ability to generate sophisticated responses across a variety of tasks and create novel content such as computer code and art showcased a significant leap in AI capabilities.

By late 2022, OpenAI transitioned ChatGPT to GPT‐3.5 and eventually introduced GPT‐4 in March 2023, further enhancing the system's multimodal capabilities and user engagement with a subscription model, ChatGPT Plus. OpenAI's trajectory has been significantly bolstered by robust financial backing, amassing a total of $11.3 billion in funding over 10 rounds until August 2023. Noteworthy is the $13 billion investment from Microsoft, which has provided not only a substantial financial runway but also strategic partnerships in various ventures.

OpenAI operates on a pricing model hinging on cost per request and monthly quotas, providing a straightforward and flexible pricing structure for its users. The pricing varies with the type of model, with distinct models like OpenAI Ada and OpenAI Babbage priced differently for different use cases. The revenue landscape of OpenAI is on an upswing, with projections indicating a surge from $10 million in 2022 to $200 million in 2023, and a staggering $1 billion by 2024.

OpenAI's CEO, Sam Altman, revealed a revenue pace crossing a $1.3 billion annualized rate, demonstrating a significant revenue potential with the growing user base and subscription services. The launch of ChatGPT saw a rapid user base expansion, reaching 100 million monthly active users within just two months post‐launch. Moreover, the introduction of a paid subscription service, ChatGPT Plus, didn't deter the growth, indicating a strong user willingness to pay for enhanced services. The substantial user engagement, especially from large revenue companies, correlates directly with the rising revenue trajectory.

OpenAI's journey elucidates a nuanced navigation through technological advancements, financial fortification, and a user‐centric operational model. The continual investment in cutting‐edge AI models, coupled with a growing user base and strategic financial backing, underscores OpenAI's substantial impact in the AI domain and its potential for further revenue generation and technological innovation.

Hugging Face: Open‐Source Community Building

Founded in 2016, Hugging Face pioneered an open ecosystem for natural language processing (NLP) based on sharing pre‐trained models. By 2022, its website hosted more than 100,000 daily active users accessing a broad range of AI capabilities. However, the emergence of LLMs—AI systems with billions of parameters—threatened Hugging Face's ability to support user growth economically. This case examines how Hugging Face adapted its platform architecture and operations to scale out and serve massive user demand while keeping costs contained even as model sizes exploded.

In recent years, AI models have grown exponentially larger. For example, OpenAI's GPT‐3 contained 175 billion parameters in 2020. The trend accelerated in 2021 and 2022, with models reaching trillions of parameters. Practically, we see that this vertical scaling to larger and larger models may not be sustainable, so several companies are considering hosting a collection of large (as opposed to one very large) model. These LLMs demonstrated new NLP capabilities but required massive compute resources for training and inference. For Hugging Face, LLMs presented a dilemma. Users expected access to cutting‐edge models like GPT‐3, but running them required costly cloud computing resources. As a small startup, Hugging Face had limited ability to absorb these costs, especially as user counts approached six figures. Providing LLMs through their existing infrastructure would force Hugging Face to either restrict access, pass costs to users, or operate at a loss. A new approach was needed to economically scale out AI: optimizing model hosting. Hugging Face's first initiative focused on optimizing their model hosting architecture. In their original setup, models were stored together with code in a monolithic GitHub repository. This might have worked initially but did not allow computational separation of storage and inference. Engineers redesigned the architecture as microservices, splitting storage and compute. Models were moved to scalable cloud object storage like S3, while compute happened in isolated containers on demand. This allowed independently scaling storage and compute to match user demand. Large models could be affordably stored while compute scaled elastically with usage.

Next, Hugging Face optimized inference itself. Out‐of‐the‐box PyTorch and TensorFlow were flexible but slow. So, engineers created optimized model servers that reduced overhead. For example, request batching allowed amortizing costs over multiple inferences. Execution was also streamlined by eliminating excess framework code. Together, these optimizations reduced compute requirements by up to 3x. Additional savings came from aggressively right‐sizing instances. Usage patterns and models were analyzed to select ideal CPU/GPU configurations. The result was inference costs cut by nearly 80% compared to off‐the‐shelf solutions.

Democratizing access with caching despite optimizations, LLMs still carried high compute costs. To further reduce expenses, Hugging Face deployed aggressive caching: once a model produced an output for a given input, the result was cached. Subsequent identical requests reused the cached output rather than rerunning inference. Popular models saw cache hit rates above 90%, greatly reducing compute needs. This worked thanks to Hugging Face's scale; similar inputs recurred frequently across the large user base. Caching allowed democratizing access to expensive LLMs that would otherwise be available to only few users. The cache layer also added monitoring capabilities for usage insights.

As usage grew, Hugging Face needed further scalability. Its final strategy was pooling community resources via a federated compute network. Users could volunteer spare computing power in return for platform credit. Requests were dynamically routed to volunteer resources based on load, geographic proximity, and costs. This federated architecture achieved almost unlimited scale at low costs by tapping underutilized capacity. Volunteers benefited by earning credits for their own platform usage. The network was unified through a blockchain‐based coordination layer for secure decentralized orchestration. Hugging Face's architectural optimizations and federated model enabled scaling to serve more than 100,000 daily users at just $0.001 inference cost per request. Despite exponential LLM growth, costs remained contained through efficiency gains. Platform contributions also increased as volunteers shared resources in exchange for credits.

This scalable, open‐source oriented approach unlocked AI for the entire community. By innovatively pooling collective capacity, Hugging Face democratized access to capabilities once available only to tech giants. This story provides lessons for sustainably scaling out AI alongside the relentless growth in model size and complexity.

Bloomberg GPT: LLMs in Large Commercial Institutions

Bloomberg, known worldwide for its financial data and analytics, took a big step by developing its large language model called Bloomberg GPT. This was driven by the growing need for better NLP capabilities in finance to help with decision‐making and customer interactions.

Bloomberg's venture into the realm of LLMs represents a forward‐thinking endeavor to harness the potential of AI in financial analytics and services. With an ambitious goal, Bloomberg aimed to develop a model capable of understanding and generating human‐like text, tailored to the financial sector's nuanced needs. The project was not only a technological endeavor but also a strategic move to stay ahead in the highly competitive financial information services arena.

The model, boasting 50 billion parameters, is a testament to Bloomberg's commitment to cutting‐edge innovation. This extensive model size necessitated a significant investment in computational resources. The training phase consumed a staggering 1.3 million hours of GPU time, showcasing the intensive computational demand that large language models entail. Yet, it was a necessary venture to develop a model with a deep understanding of financial lexicon and concepts.

Bloomberg's approach was unique. The company engaged in reinforcement learning from human feedback (RLHF), a method that utilized human feedback to fine‐tune the model iteratively. This approach enabled the model to better understand and generate financial text, improving its performance significantly over several iterations. The in‐house development allowed for a tailored approach, ensuring the model's alignment with Bloomberg's specific requirements in financial analytics and reporting.

The financial commitment to this project was substantial, reflecting Bloomberg's strategic investment in AI as a long‐term asset. While the exact figures remain undisclosed, industry estimates place the development of such models in the range of tens to hundreds of millions of dollars. The investment extends beyond the model itself to a robust infrastructure capable of supporting the model's computational demands and the talent required to develop and maintain such a sophisticated AI system.

The ability to provide insightful financial analytics and generate human‐like text proved to be a valuable asset, offering a competitive advantage in the fast‐paced financial services sector. Several months after the publication of the model, no other organization of the same scale has publicly announced a competitive foundational model for finance. The model's success also demonstrates the significant potential and value that large language models hold in specialized domains.

As of this writing, Bloomberg plans to commercialize this technology by integrating it into its existing suite of financial analytics tools. The model will power new features, providing more in‐depth insights and analytics to Bloomberg's clientele. Additionally, the model serves as a foundation for future internal and external‐facing AI projects, showcasing the company's capability and commitment to leveraging AI for better financial analysis and decision‐making.

The Bloomberg GPT project underscores the substantial financial and computational investments required to develop specialized large language models. It also illustrates the strategic importance of AI in the financial sector, not only as a tool for better analytics but as a competitive differentiator in a market where timely and accurate information is crucial.

WHO IS THIS BOOK FOR?

This book was crafted with a broad spectrum of readers in mind, encompassing a range of individuals who are either enthralled by the promise of GenAI or actively engaged in its exploration and application. Whether you are a budding enthusiast, a citizen data scientist, a seasoned researcher, a rockstar engineer, or a visionary decision‐maker, this book has insights that can help you along the pathway to cost‐effective GenAI applications.

AI practitioners:

For those immersed in the day‐to‐day endeavor of building, tuning, and deploying AI models, this book offers a collection of strategies and techniques for cost optimization, helping to maximize the value and impact of your work while minimizing expenditure.

Researchers:

Academics and researchers delving into the frontiers of GenAI and large language models will find a structured discourse on the economic aspects that underpin the practical deployment of research findings. This book aims to bridge the chasm between academic exploration and real‐world application, shedding light on cost‐effectiveness as a critical vector.

Engineers:

Engineers standing at the confluence of software, hardware, and AI will discover a wealth of knowledge on how to architect, implement, and optimize systems for cost efficiency while harnessing the potential of large language models.

Educators and students:

Educators aiming to equip students with a holistic understanding of GenAI will find this book a valuable resource. Similarly, students aspiring to delve into this exciting domain will garner a pragmatic understanding of the cost dynamics involved.

Tech enthusiasts:

If you are captivated by the unfolding narrative of AI and its potential to shape the future, this book offers a lens through which you can appreciate the economic dimensions that are integral to making this promise a reality.

Policy makers:

Those engaged in shaping the policy framework around AI and data utilization will find insightful discussions on the cost considerations that are imperative for fostering a sustainable and inclusive AI ecosystem.

Decision‐makers:

For decision‐makers steering the strategic direction of organizations, this book provides a lucid understanding of the economic landscape of GenAI applications. It elucidates the cost implications, risks, and opportunities that accompany the journey toward leveraging GenAI for business advantage.

In essence, this book caters to a large and diverse readership, aiming to engender a nuanced understanding of cost optimization in the realm of GenAI and large language models. Through a blend of technical exposition, real‐world case studies, and strategic insights, it seeks to foster an informed dialogue and pragmatic action toward cost‐effective and responsible AI deployment.

SUMMARY

This chapter introduced the world of GenAI and LLMs and highlighted the importance of cost optimization. It presented three micro case studies to help you further understand what it takes for even large, well‐funded organizations to achieve scale while controlling costs.

1Introduction

WHAT'S IN THIS CHAPTER?

Overview of GenAI Applications and Large Language Models

Paths to Productionizing GenAI Applications

The Importance of Cost Optimization

OVERVIEW OF GenAI APPLICATIONS AND LARGE LANGUAGE MODELS

In this section, we introduce GenAI applications and large language models.

The Rise of Large Language Models

Large language models (LLMs) have become a cornerstone of artificial intelligence (AI) research and applications, transforming the way we interact with technology and enabling breakthroughs in natural language processing (NLP). These models have evolved rapidly, with their origins dating back to the 1950s and 1960s, when researchers at IBM and Georgetown University developed a system to automatically translate a collection of phrases from Russian to English. The early pioneers were optimistic that human‐level intelligence would soon be within reach. However, building thinking machines akin to the human mind proved more challenging than anticipated. In the initial decades, research in AI was focused on symbolic reasoning and logic‐based systems. But these early AI systems were quite brittle and limited in their capabilities. They struggled with commonsense knowledge and making inferences in the real world.

By the 1980s, AI researchers realized that rule‐based programming alone could not replicate the versatility and robustness of human intelligence. This led to the emergence of machine learning techniques, where algorithms are trained on large amounts of data to pick up statistical patterns. Instead of hard‐coding complex rules, the key idea was to have systems automatically learn from experience and improve their performance. Machine learning enabled progress in specialized domains such as computer vision and speech recognition. But the overarching goal of achieving artificial general intelligence remained distant.

The limitations of earlier approaches led scientists to look at AI through a new lens. Rather than explicit programming, perhaps deep learning neural networks could be the answer. Neural networks are computing systems inspired by the biological neural networks in the human brain. They consist of layers of interconnected nodes that transmit signals between input and output. By training on huge amounts of data, these multilayered networks could potentially learn representations and patterns too complex for humans to hard‐code using rules.

NOTELanguage is a complex and intricate system of human expressions governed by grammatical rules. It therefore poses a significant challenge to develop capable AI algorithms for comprehending and grasping a language. Language modeling is one of the major approaches to advancing machine language intelligence. In general, language modeling aims to model the generative likelihood of word sequences, so as to predict the probabilities of future (or missing) tokens. Language modeling research has received extensive attention in the literature, which can be divided into four major development stages: statistical language models (SLMs), neural language models (NLMs), pre‐trained language models (PLMs), and large language models.

In the 2010s, deep learning finally enabled a breakthrough in AI capabilities. With sufficient data and computing power, deep neural networks achieved remarkable accuracy in perception tasks such as image classification and speech recognition. However, these systems were narrow in scope, focused on pattern recognition in specific domains. Another challenge was that they required massive labeled datasets for supervised training. Obtaining such rich annotation at scale for complex cognitive tasks proved infeasible.

This is where self‐supervised generative modeling opened new possibilities. By training massive neural network models to generate representations from unlabeled data itself, systems could learn powerful feature representations. Self‐supervised learning could scale more easily by utilizing the abundant digital data available on the Internet and elsewhere. Language modeling emerged as a promising approach, where neural networks are trained to predict the next word in a sequence of text.

Neural Networks, Transformers, and Beyond

Language modeling has been studied for decades using statistical methods like n‐gram models. But neural network architectures were found to be much more effective, leading to the field of neural language modeling. Word vectors trained with language modeling formed useful representations that could be leveraged for various natural language processing tasks.

Around 2013, an unsupervised learning approach called word2vec became popular. It allowed efficiently training shallow neural networks to generate word embeddings from unlabeled text data. The word2vec embeddings were useful for downstream NLP tasks when used as input features. This demonstrated the power of pre‐training word representations on large textual data.

The next major development was the proposal of ELMo by Allen Institute researchers in 2018. ELMo introduced deep contextualized word representations using pre‐trained bidirectional long short‐term memory (LSTM). The internal states of the bidirectional LSTM (BiLSTM) over a sentence were used as powerful context‐based word embeddings. ELMo embeddings led to big performance gains in question answering and other language understanding tasks.

Later in 2018, Google AI proposed the revolutionary Bidirectional Encoders from Transformers (BERT) model. BERT is a novel self‐attention neural architecture. BERT introduced a new pre‐training approach called masked language modeling on unlabeled text. The pre‐trained BERT model achieved huge performance gains across diverse NLP tasks by merely fine‐tuning on task datasets.

The immense success of BERT established the “pre‐train and fine‐tune” paradigm in NLP. Many more transformer‐based pre‐trained language models were proposed after BERT, such as XLNet, RoBERTa, T5, etc. Scaling model size as well as unsupervised pre‐training strategies yielded better transfer learning performance on downstream tasks.

However, model sizes were still limited to hundreds of millions of parameters in most cases. In 2020, OpenAI proposed GPT‐3, which scaled up model parameters to an unprecedented 175 billion! GPT‐3 demonstrated zero‐shot, few‐shot learning capabilities never observed before, stunning the AI community. Without any gradient updates or fine‐tuning, GPT‐3 could perform NLP tasks from just task descriptions and a few examples. As such, GPT‐3 highlighted the power of scale in language models. Its surprising effectiveness motivated intense research interest in training even larger models. This led to the exploration of LLMs with model parameters in the trillion+ range. Startups such as Anthropic and public efforts such as PaLM, Gopher, and LLaMA pushed model scale drastically with significant investments in the space. Several tech companies and startups are now using (and training their own) LLMs with hundreds of billions or even a trillion plus parameters. Models like PaLM, Flan, LaMDA, and LLaMA have demonstrated the scalability of language modeling objectives using the transformer architecture. At the time of this writing, Anthropic has developed Claude, the first LLM to be openly released with conversational abilities rivaling GPT‐3.

You can see that all the models mentioned are related, much like the Tree of Life. In other words, anatomical similarities and differences in a phylogenetic tree are similar to the architectural similarities found in language models. For example, Figure 1.1 shows the evolutionary tree of LLMs and highlights some of the most popular models used in production so far. The models that belong to the same branch are more closely related, and the vertical position of each model on the timeline indicates when it was released. The transformer models are represented by colors other than gray: decoder‐only models like GPT, OPT and their derivatives, encoder‐only models like BERT, and the encoder‐decoder models T5 and Switch are shown in separate main branches. As mentioned earlier, models have successively “grown” larger. Interestingly, this is visually and objectively similar to the evolution of intelligent species, as shown in Figure 1.2. A deeper comparison is out of the scope of this book, but for more information on either of these evolutionary trees, refer to the links in the captions.

FIGURE 1.1: Evolutionary tree of language models(see Rice University / https://arxiv.org/pdf/2304.13712.pdf / last accessed December 12, 2023.)

FIGURE 1.2: Evolutionary tree of human brain structure (see André M.M. Sousa et al., 2017/ with permission of Elsevier.)

Increasing the model size, compute, and data seems to unlock new abilities in LLMs, which exhibit impressive performance on question answering, reasoning, and text generation with simple prompting techniques. By training LLMs to generate code, models such as AlphaCode and Codex display proficient coding skills. LLMs can chat, translate, summarize, and even write mathematical proofs aided by suitable prompting strategies.

The key shift from PLMs to LLMs is that scale seems to bring about qualitative transitions beyond just incremental improvements. LLMs display certain emergent capabilities such as few‐shot learning, chain of reasoning, and instruction following not observed in smaller models. These abilities emerge suddenly once model scale crosses a sufficient threshold, defying smooth scaling trends.

LLMs entail a paradigm shift in AI from narrowly specialized systems to versatile, general‐purpose models. Leading experts feel recent LLMs display signs of approaching human‐level artificial general intelligence. From statistical to neural networks, the steady progress in language modeling scaled up by orders of magnitude has been the missing link enabling this rapid advancement toward more human‐like flexible intelligence. The astounding capabilities of GPT‐3 highlighted the power of scale in language models. This has led to intense research interest in developing even larger LLMs with model parameters in the trillion range. The assumption is that bigger is better when it comes to language AI. Scaling model size along with compute and data seems to unlock new abilities and performance improvements.

The largest LLMs have shown the ability to perform human‐level question answering and reasoning in many domains without any fine‐tuning. With proper prompting techniques like chain of thought, they can solve complex arithmetic, logical, and symbolic reasoning problems. LLMs can intelligently manipulate symbols, numbers, concepts, and perform multistep inferences when presented with the right examples.

But of course, language generation is the main area where LLMs’ capabilities have taken a huge leap. LLMs can generate fluent, coherent, and human‐like text spanning news articles, poetry, dialogue, code, mathematical proofs, and more. The creativity and versatility displayed in conditional and unconditioned text generation are remarkable. Few‐shot prompting allows controlling attributes such as length, style, content, etc. Text‐to‐image generation has also made rapid progress leveraging LLMs. The exponential growth in model parameters has been matched by the computing power and datasets availability. Modern GPU clusters, the emergence of model parallelism techniques, and optimized software libraries have enabled training LLMs with trillions of parameters. Massive text corpora for pre‐training are sourced from the Internet and digitization initiatives.

All this has fueled tremendous excitement and optimism about the future of AI. LLMs display a form of algorithmic and statistical intelligence to solve many problems automatically given the right data. Leading AI experts believe rapid recent progress is bringing us closer to artificial general intelligence than before. Large language models may be the missing piece that enables machines to learn concepts, infer chains of reasoning, and solve problems by formulating algorithms like humans.

LLMs still have major limitations. They are expensive and difficult to put into production, prone to hallucination, lack common sense, and struggle with complex symbolic reasoning. Model capabilities are also severely constrained by the training data distribution. LLMs can propagate harmful biases, generate toxic outputs, and be manipulated in dangerous ways. There are rising concerns around AI ethics, governance, and risks that merit careful consideration. Responsible development of AI aligned with human values is necessary. However, we already see several generative AI (GenAI) applications with these LLMs at their core! GenAI heralds a paradigm shift from narrow analytical intelligence toward creative and versatile systems. GenAI applications powered by models such as GPT‐3, PaLM, and Claude are displaying remarkable abilities previously thought impossible for machines.

GenAI vs. LLMs: What's the Difference?

While both GenAI and LLMs deal with generating content, their scopes and applications differ. GenAI is a broader term that encompasses AI systems capable of creating various types of content, such as text, images, videos, and other media. LLMs, on the other hand, are a specific class of deep learning models designed to process and understand natural language data. LLMs are used as a core component in GenAI applications to generate human‐like text. GenAI applications, on the other hand, use LLMs to create more comprehensive and interactive experiences for users. While LLMs are responsible for understanding and generating human‐like text, GenAI applications utilize these capabilities to create more comprehensive and interactive experiences for users.

GenAI applications are full end‐to‐end applications that could involve LLMs as their core. For example, ChatGPT is a GenAI application with GPT‐3.5 and GPT‐4 at its core. This means that while LLMs are responsible for understanding and generating human‐like text, GenAI applications utilize these capabilities to create more comprehensive and interactive experiences for users. Putting LLMs into production as GenAI applications requires overcoming several challenges, including aligning LLMs with human values and preferences, training LLMs due to their huge model size, adapting LLMs for specific downstream tasks, and evaluating the abilities of LLMs. Despite these challenges, LLMs have the potential to revolutionize the way we develop and use AI algorithms, and they are poised to have a significant impact on the AI community, and society in general.

LLMs have enabled remarkable advances in GenAI applications in recent years. By learning from vast amounts of text data, LLMs like GPT‐3 and PaLM can generate highly fluent and coherent language. This capability has been harnessed to power a diverse range of GenAI applications that were previously infeasible. Let's discuss some popular GenAI applications here:

Conversational agents and chatbots

: One of the most popular applications of LLMs is conversational agents and chatbots. Systems like Anthropic's Claude and Google's LaMDA leverage the language generation skills of LLMs to conduct natural conversations. They can answer questions, offer advice, and discuss open‐ended topics through multiturn dialogue. The conversational abilities of these agents derive from the pre‐training of LLMs on massive dialogue corpora. Fine‐tuning further adapts them for smooth and consistent conversations.

Code completion and programming assistants:

LLMs have proven adept at code generation and completion. Tools such as GitHub Copilot and TabNine auto‐complete code based on natural language comments and existing context. This assists programmers by reducing boilerplate and speeding up development. LLMs can also generate entire code snippets or functions given high‐level descriptions. Their training on large code corpora enables robust translation of natural language to programming languages. Beyond autocompletion, LLMs could serve as AI pair programmers that suggest improvements to code.

Language translation machine translation:

This has benefited enormously from LLMs. Models such as Google's Translation LM achieve state‐of‐the‐art results by learning representations from massive text corpora. They can translate between languages with higher accuracy and more contextual fidelity than previous phrase‐based translation systems. LLMs retain knowledge about linguistics, grammar, and semantics that improves translation quality. Their training methodology also facilitates zero‐shot translation between multiple languages.

Text summarization and generation LLMs:

These are unmatched at summarizing lengthy text into concise overviews. They distill key points while preserving semantic consistency. Applications built using LLMs can summarize emails, articles, legal documents, and other sources as a text companion. Conditional generation allows summarization to be tailored for different lengths, styles, or perspectives. Text generation applications powered by LLMs can produce original long‐form content such as stories, poems, and articles.

Even with this broad and deep set of capabilities and active research, companies around the world today may be concerned about the following aspects around productionizing GenAI‐based applications:

Access to high‐performing LLMs is limited, costly, and not straightforward; commercial models are black boxes behind a paywall, and open‐source models are mostly licensed for academic and research use, not for commercial applications.

Building advanced applications that seamlessly integrate with existing databases and analytical applications is not trivial.

There is no transparent architecture that ensures privacy and confidentiality of internal customer data while fine‐tuning or using foundational models through an API. Questions around copyrights, trust, and safety are important and far from fully solved today.

In the next section, we begin to dive deeper into these concepts, starting with a framework to think of how a typical GenAI application is built today.

The Three‐Layer GenAI Application Stack

The rapid advancements in large language models like GPT‐3, Claude, and PaLM have enabled new generative AI applications that can produce high‐quality text, code, images, and more. However, developing and deploying these GenAI applications requires bringing together diverse components into an integrated technology stack. We will discuss the three main layers of a GenAI application stack—the infrastructure layer, the model layer, and the application layer—and delve into the details of each.

The Infrastructure Layer

The infrastructure layer provides the foundational data, compute, and tooling resources needed to develop, train, and serve LLMs.

Data storage and management:

LLMs require massive, labeled datasets, often petabytes in size, to train the models. Many public datasets like Common Crawl and Wikipedia provide broad coverage for pre‐training foundation models. Custom datasets tailored to specific domains or applications are also curated for fine‐tuning. This training data needs to be stored, managed, and accessed efficiently. Distributed object storage systems like Amazon S3, Azure Blob Storage, and Google Cloud Storage allow storing the vast datasets affordably. Data lakes built on cloud storage act as centralized repositories to gather, clean, and process heterogeneous data from diverse sources.

Metadata catalogs like AWS Glue, Azure Purview, and Google Data Catalog maintain schemas, trace data lineage, and enable discovery. Versioning capabilities track data changes over time. Data quality tools monitor and profile datasets. Overall, a robust data management platform is essential for LLMs to benefit from high‐quality, well‐organized training data.

Vector databases

: LLMs rely on representing words and documents as dense numeric vectors that encode semantic meaning. Vector databases like Anthropic's Constitutional AI, Pinecone, Weaviate, and Milvus specialize in storing and indexing billions of vectors to enable efficient similarity search and retrieval.

They allow embedding large text corpora into vector spaces where semantic relationships are captured through distance metrics such as cosine similarity. This powers capabilities such as semantic search that go beyond keyword matching. Using vector databases decouples storing embeddings from model training and serving, enabling shared knowledge representation across applications.

Compute infrastructure

: Training and running LLMs places intense computational demands, requiring access to extensive GPU/TPU resources. For example, training GPT‐3 took 3,640 petaflop/s‐days on more than 10,000 GPUs.

Cloud infrastructure providers such as AWS, Azure, and GCP offer GPU/TPU‐optimized virtual machine instances. For example, AWS provides P4d instances with up to eight GPUs for machine learning workloads. GCP's Cloud TPU VMs give direct access to TPU chips tailored for ML.

Autoscaling groups dynamically match resources like GPUs to fluctuating training and inference demands. Orchestrators like Kubernetes facilitate deploying distributed LLMs at scale. The compute fabric should provide high‐throughput, low‐latency networking to support parallel model training across GPU clusters.

The Model Layer

The model layer is the heart of a GenAI application stack, where the choice and adaptation of a large language model are crucial. This layer plays a pivotal role in determining the capabilities, efficiency, and effectiveness of the generative AI application.

Choosing the right LLM:

When selecting an LLM as the foundation for your application, several key factors must be considered.

Model capability:

The first consideration is the model's inherent capability. Different LLMs may excel in various tasks. For instance, models like Google's BERT are renowned for their understanding of contextual information, while autoregressive models like OpenAI's GPT series are exceptional at generating coherent text.

Computational efficiency:

The computational resources required to train and deploy the chosen LLM are essential. Some models may be more resource‐intensive than others, which can impact the scalability and cost of your application.

Commercial availability:

The availability of the model can be crucial. Many tech companies have open‐sourced LLMs, allowing developers to use them freely. Alternatively, cloud providers offer LLMs via APIs, making them accessible but often with associated costs.

Problem domain suitability:

Consider the specific problem domain your application targets. Certain LLMs may be better suited for tasks such as text summarization (e.g., T5), while others shine in creative text generation (e.g., GPT).

Fine‐tuning the LLM:

Once you've selected an LLM as your foundation, fine‐tuning becomes essential to adapt it to your application's unique requirements. Fine‐tuning techniques, often based on transfer learning, are employed to achieve this adaptation. Take targeted datasets, for example.

Targeted datasets:

Fine‐tuning typically involves training the LLM on smaller, domain‐specific datasets. This helps the model become more proficient in specialized tasks while preserving its general intelligence.

Enhancing capabilities:

By fine‐tuning, you can improve the LLM's performance in specific areas, such as dialogue systems, reasoning, or knowledge retrieval. Anthropic's Claude, for instance, undergoes fine‐tuning to enhance its dialogue capabilities without compromising its overall intelligence.

Avoiding catastrophic forgetting:

Care must be taken during fine‐tuning to avoid “catastrophic forgetting,” where the model loses previously learned knowledge. Techniques such as gradient clipping and selective fine‐tuning of specific layers are used to mitigate this risk.

Integrating the LLM:

The integration of the selected and fine‐tuned LLM into the application is a critical step.

Data transformation:

Application code needs to convert input data, such as text or other types of data, into formats suitable for the LLM, typically token embeddings. This transformation is essential to ensure that the model can process the data effectively.

Model architecture:

The architecture of the LLM plays a significant role in how it processes input data. Factors like how self‐attention is applied across input tokens determine the model's ability to capture long‐range dependencies in the data.

Scaling innovations:

Recent advancements enable the efficient scaling of LLMs. Techniques such as sparsely gated mixture‐of‐experts partition model layers into multiple expert groups, enhancing scalability. This technique uses a trainable portion of the neural network to conditionally pass through some inputs, based on the actual contents of the input itself. For more information about this interesting advancement, refer to

https://arxiv.org/pdf/1701.06538.pdf

.

Additionally, efficient attention mechanisms like BigBird and Reformer reduce self‐attention complexity for handling long sequences efficiently.

NOTE