32,39 €
Keeping up with the AI revolution and its application in coding can be challenging, but with guidance from AI and ML expert Dr. Vincent Hall—who holds a PhD in machine learning and has extensive experience in licensed software development—this book helps both new and experienced coders to quickly adopt best practices and stay relevant in the field.
You’ll learn how to use LLMs such as ChatGPT and Gemini to produce efficient, explainable, and shareable code and discover techniques to maximize the potential of LLMs. The book focuses on integrated development environments (IDEs) and provides tips to avoid pitfalls, such as bias and unexplainable code, to accelerate your coding speed. You’ll master advanced coding applications with LLMs, including refactoring, debugging, and optimization, while examining ethical considerations, biases, and legal implications. You’ll also use cutting-edge tools for code generation, architecting, description, and testing to avoid legal hassles while advancing your career.
By the end of this book, you’ll be well-prepared for future innovations in AI-driven software development, with the ability to anticipate emerging LLM technologies and generate ideas that shape the future of development.
Das E-Book können Sie in Legimi-Apps oder einer beliebigen App lesen, die das folgende Format unterstützen:
Seitenzahl: 442
Veröffentlichungsjahr: 2024
Coding with ChatGPT and Other LLMs
Navigate LLMs for effective coding, debugging, and AI-driven development
Dr. Vincent Austin Hall
Copyright © 2024 Packt Publishing
All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.
The author acknowledges the use of cutting-edge AI, such as ChatGPT, with the sole aim of enhancing the language and clarity within the book, thereby ensuring a smooth reading experience for readers. It’s important to note that the content itself has been crafted by the author and edited by a professional publishing team.
Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing or its dealers and distributors, will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book.
Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.
Group Product Manager: Niranjan Naikwadi
Publishing Product Manager: Nitin Nainani
Book Project Manager: Aparna Nair
Senior Editor: Joseph Sunil
Technical Editor: Rahul Limbachiya
Copy Editor: Safis Editing
Proofreader: Joseph Sunil
Indexer: Manju Arasan
Production Designer: Joshua Misquitta
Senior DevRel Marketing Executive: Vinishka Kalra
First published: November 2024
Production reference: 1061124
Published by Packt Publishing Ltd.
Grosvenor House
11 St Paul’s Square
Birmingham
B3 1RB, UK.
ISBN 978-1-80512-505-1
www.packtpub.com
Dr. Vincent Austin Hall is a computer science lecturer at Birmingham Newman University and CEO of Build Intellect Ltd, an AI consultancy. Build Intellect works closely with ABT News LTD, based in Reading, England. He holds a physics degree from the University of Leeds, an MSc in biology, chemistry, maths, and coding from Warwick, and a PhD in machine learning and chemistry, also from Warwick, where he developed licensed software for pharma applications. With experience in tech firms and academia, he’s worked on ML projects in the automotive and medtech sectors. He supervises dissertations at the University of Exeter, consults on AI strategies, coaches students and professionals, and shares insights through blogs and YouTube content.
I would like to thank my supportive and patient family: my excellent and wise partner Anna, our brilliant, different, and loving son Peter and our brilliant, inventive, and hilarious daughter Lara, for allowing me time to work on this book over many weekends and evenings and understanding that good things take long, hard work, and many iterations.
Thank you to Packt Publishing: Editor Joseph Sunil for making only good suggestions and improving my work; Book Project Manager, Aparna Nair for keeping the project progressing well and making sure everything got done; Publishing Product Manager, Nitin Nainani for managing and further direction; Priyanshi J for bringing me on board and suggesting this book in the first place; as well as the technical reviewers for helping Joseph and me to keep the book quality high.
Thanks to my business partner, Chief Chigbo Uzokwelu, CEO of ABT News Ltd, for lots of support in friendship and business: legal, sales, business communications, proof reading, and marketing.
Thanks to the reader for reading and learning, sharing what you've learned and helping others to upskill and create the best code, careers and solutions for Earth (and future populated worlds).
Parth Santpurkar is a senior software engineer with over a decade of industry experience based out of the San Francisco Bay area. He’s a senior IEEE member and his expertise and interests range from software engineering and distributed systems to machine learning and artificial intelligence.
Sougata Pal is a passionate technology specialist performing the role of an enterprise architect in software architecture design and application scalability management, team building, and management. With over 15 years of experience, they have worked with different start-ups and large-scale enterprises to develop their business application infrastructure, enhancing their reach to customers. They have contributed to different open source projects on GitHub to empower the open source community. For the last couple of years, they have playing around with federated learning and cybersecurity algorithms to enhance the performance of cybersecurity processes by introducing concepts of federated learning.
This section lays the groundwork for understanding Large Language Models (LLMs) and their transformative potential across various fields. It introduces LLMs like ChatGPT, explaining how they work. We will also explore different ways that LLMs are applied across industries, from customer service to content generation. We will also check out the unique capabilities of LLMs in software development.
This section covers the following chapters:
Chapter 1, What is ChatGPT and what are LLMs?Chapter 2, Unleashing the Power of LLMs for Coding: A Paradigm ShiftChapter 3, Code Refactoring, Debugging, and Optimization: A Practical GuideThe world has been strongly influenced by the recent advancements in AI, especially large language models (LLMs) such as ChatGPT and Gemini (formerly Bard). We’ve witnessed stories such as OpenAI reaching one million users in five days, huge tech company lay-offs, history-revising image scandals, more tech companies getting multi-trillion dollar valuations (Microsoft and NVIDIA), a call for funding of $5–7 trillion for the next stage of technology, and talks of revolutions in how everythingis done!
Yes, these are all because of new AI technologies, especially LLM tech.
LLMs are large in multiple ways: not just large training sets and large training costs but also large impacts on the world!
This book is about harnessing that power effectively, for your benefit, if you are a coder.
Coding has changed, and we must all keep up or else our skills will become redundant or outdated. In this book are tools needed by coders to quickly generate code and do it well, to comment, debug, document, and stay ethical and on the right side of the law.
If you’re a programmer or coder, this is for you. Software, especially AI/machine learning, is changing everything at ever-accelerating rates, so you’ll have to learn this stuff quickly, and then use it to create and understand future technologies.
I don’t want to delay you any longer, so let’s get into the first chapter.
In this chapter, we’ll cover some basics of ChatGPT, Gemini, and other LLMs, where they come from, who develops them, and what the architectures entail. We’ll introduce some organizations that use LLMs and their services. We’ll also briefly touch on some mathematics that go into LLMs. Lastly, we’ll check out some of the competition and applications of LLMs in the field.
This chapter covers the following topics:
Introduction to LLMsOrigins of LLMsEarly LLMsExploring modern LLMsHow transformers workApplications of LLMsChatGPT is an LLM. LLMs can be used to answer questions and generate emails, marketing materials, blogs, video scripts, code, and even books that look a lot like they’ve been written by humans. However, you probably want to know about the technology.
Let’s start with what an LLM is.
LLMs are deep learning models, specifically, transformer networks or just “transformers.” Transformers certainly have transformed our culture!
An LLM is trained on huge amounts of text data, petabytes (thousands of terabytes) of data, and predicts the next word or words. Due to the way LLMs operate, they are not perfect at outputting text; they can give alternative facts, facts that are “hallucinated.”
ChatGPT is, as of the time of writing, the most popular and famous LLM, created and managed by OpenAI. OpenAI is a charity and a capped-profit organization based in San Francisco [OpenAI_LP, OpenAIStructure].
ChatGPT is now widely used for multiple purposes by a huge number of people around the world. Of course, there’s GPT-4 and now GPT-4 Turbo, which are paid, more powerful, and do more things, as well as taking more text in prompts.
It’s called ChatGPT: Chat because that’s what you do with it, it’s a chatbot, and GPT is the technology and stands for generative pre-trained transformer. We will get more into that in the GPT lineage subsection.
A transformer is a type of neural network architecture, and a transformer is the basis of the most successful LLMs today (2024). GPT is a Generative Pre-trained Transformer. Gemini is a transformer [ChatGPT, Gemini, Menon, HuggingFace]. OpenAI’s GPT-4 is a remarkable advancement in the field of AI. This model, which is the fourth iteration of the GPT series, has introduced a new feature: the ability to generate images alongside text. This is a significant leap from its predecessors, which were primarily text-based models.
OpenAI also has an image generation AI, DALL-E, and an AI that can connect images and text and does image recognition, called CLIP (OpenAI_CLIP). The image generation capability of DALL-E is achieved by training the transformer model on image data. This means that the model has been exposed to a vast array of images during its training phase, enabling it to understand and generate visual content [OpenAI_DALL.E].
Furthermore, since images can be sequenced to form videos, DALL.E can also be considered a video generator. This opens up a plethora of possibilities for content creation, ranging from static images to dynamic videos. It’s a testament to the versatility and power of transformer models, and a glimpse into the future of AI capabilities.
In essence, tools from OpenAI are not just text generators but a comprehensive suite of content generators, capable of producing a diverse range of outputs. It’s called being multi-modal. This makes these tools invaluable in numerous applications, from content creation and graphic design to research and development. The evolution from GPT-3 to GPT-4 signifies a major milestone in AI development, pushing the boundaries of what AI models can achieve.
Earlier neural networks with their ability to read sentences and predict the next word could only read one word at a time and were called recurrent neural networks, (RNNs). RNNs attempted to mimic human-like sequential processing of words and sentences but faced challenges in handling long-term dependencies between words and sentences due to very limited memory capacity.
In 1925, the groundwork was laid by Wilhelm Lenz and Ernst Ising with their non-learning Ising model, considered an early RNN architecture [Brush, Gemini].
In 1972, Shun’ichi Amari made this architecture adaptive, paving the way for learning RNNs. This work was later popularized by John Hopfield in 1982 [Amari, Gemini].
Due to this, there has been a fair amount of research to find ways to stretch this memory to include more text to get more context. RNNs are transformers. There are other transformers, including LSTMs, which are long short-term memory neural networks that are based on a more advanced version of RNNs, but we won’t go into that here [Brownlee_LLMs, Gemini]. LSTMs were invented by Hochreiter and Schmidhuber in 1997 [Wiki_LSTM, Hochreiter1997].
There is another network called the convolutional neural network (CNN). Without going into much detail, CNNs are very good at images and lead the world in image recognition and similar jobs. CNNs (or ConvNets) were invented in 1980 by Kunihiko Fukushima and developed by Yann LeCun, but they only really became popular in the 2000s, when GPUs became available. Chellapilla et al. tested the speeds of training CNNs on CPUs and GPUs and found the network trained on GPUs 4.1 times faster [Fukushima1980, LeCun1989, Chellapilla2006]. Sometimes, your inventions take time to bear fruit, but keep inventing! CNNs use many layers or stages to do many different mathematical things to their inputs and try to look at them in different ways: different angles, with detail taken out (dropout layers), pooling nearby regions of each image, zeroing negative numbers, and other tricks.
What was needed was a model with some form of memory to remember and also generate sentences and longer pieces of writing.
In 2017, Ashish Vaswani and others published a paper called Attention Is All You Need, [Vaswani, 2017]. In this important paper, the transformer architecture was proposed based on attention mechanisms. In other words, this model didn’t use recurrence and convolutions, such as RNNs and CNNs. These methods have been very successful and popular AI architectures in their own right.
Compared to RNNs and CNNs, Vaswani’s Transformer performed faster training and allowed for higher parallelizability.
The Transformer was the benchmark for English-to-German translation and established a new state-of-the-art single model in the WMT 2014 English-to-French translation task. It also performed this feat after being trained for a small fraction of the training times of the next best existing models. Indeed, Transformers were a groundbreaking advancement in natural language processing [Vaswani, 2017].
Now that we have covered the origins of LLMs, we will check out some of the earliest LLMs that were created.
There are many LLMs today and they can be put into a family tree; see Figure 1.1. The figure shows the evolution from word2vec to the most advanced LLMs in 2023: GPT-4 and Gemini [Bard].
Figure 1.1: Family tree of LLMs from word2vec to GPT-4 and Bard, from Yang2023 with permission
So, that’s all of them but, for now, we’ll look at the earlier LLMs that lead to the most advanced technologies today. We’ll start with GPT.
The development of GPT is a constantly changing and iterative process, with each new model building upon the strengths and weaknesses of its ancestors. The GPT series, initiated by OpenAI, has undergone a great deal of evolution, leading to advancements in natural language processing (NLP) and understanding.
GPT-3, the third iteration, brought a significant leap in terms of size and complexity, with an impressive 175 billion parameters. This allowed it to generate pretty human-like text across a wide range of topics and subjects [Wiki_GPT3, ProjectPro].
As the GPT series progressed, OpenAI continued to refine and enhance the architecture. In subsequent iterations, GPT-4 and GPT-4 Turbo have further pushed back the boundaries of what these LLMs can achieve. The iterative development process focuses on increasing model size and improving fine-tuning capabilities, enabling more nuanced and contextually relevant outputs.
Further to this, there are more modalities, such as GPT-4 with vision and text-to-speech.
GPT model iteration is not solely about scaling up the number of parameters; it also involves addressing the limitations observed in earlier versions. Feedback from user interactions, research findings, and technological advancements contribute to the iterative nature of the GPT series. OpenAI is constantly working to reduce the amount of inaccurate information and incoherent outputs (hallucinations) that its chatbots produce. Also, each iteration of the chatbot takes on board the lessons learned from real-world applications and user feedback.
GPT models are trained and fine-tuned on very large, diverse datasets to make sure the chatbots can adapt to many different contexts, industries, and user requirements. The iterative development approach ensures that later GPT models are better equipped to understand and generate human-like text, making them extremely valuable tools for a huge number of applications, including content creation such as blogs, scripts for videos, and copywriting (writing the text in adverts) as well as conversational agents (chatbots and AI assistants).
The way GPT models are developed iteratively shows OpenAI’s commitment to continuous improvement and innovation in the field of LLMs, allowing even more sophisticated and capable models to be built from these models in the future.
Here are the dates for when the different versions of GPT were launched:
GPT was first launched in June 2018GPT-2 was released in February 2019GPT-3 in 2020GPT-3.5 in 2022/ChatGPT in November 2022There will be more on the GPT family later, in the GPT-4 /GPT-4 Turbo section.
Here, we will detail the architecture of LLMs and how they operate.
To comprehend the roots and development of Bidirectional Encoder Representations from Transformers (BERT), we must know more about the intricate and fast-moving landscape of neural networks. Without hyperbole, BERT was a seriously important innovation in NLP, part of the ongoing evolution of AI. BERT was the state of the art for a wide range of NLP tasks in October 2018, when it was released [Gemini]. This included question answering, sentiment analysis, and text summarization.
BERT also paved the way for later R&D of LLMs; it played a pivotal role in LLM development. BERT, being open source, helped to speed up LLM advancement.
BERT takes some of its DNA from RNNs (mentioned in the Origins of LLMs section), the neural nets that loop back on themselves to create a kind of memory, although rather limited memory.
The invention of the first transformer architecture was key to the origin of BERT. The creation of BERT as a bidirectional encoder (these go backward and forward along a sentence) drew inspiration from the transformer’s attention-based mechanism, allowing it to capture contextual relationships between words in both directions within a sentence.
So, BERT’s attention is bidirectional (left-to-right and right-to-left context). At its creation, this was unique, and it enabled BERT to gain a more comprehensive understanding of nuanced language semantics.
While BERT’s foundations are in transformer architecture, its characteristics have evolved with further research and development, though it is not currently in development. Each iteration of BERT refined and expanded its capabilities.
The BERT LLM was a stage of the ongoing innovation in AI. BERT’s ability to understand language bidirectionally, drawing insights from both preceding and succeeding words, is part of the endeavors taken to achieve the creation of an AI with a sufficiently deep awareness of the intricacies of natural language.
Figure 1.2: Architecture of BERT, a bidirectional encoder (reproduced from GeekCultureBERT)
Understanding the ancestry of Language Model for Dialogue Applications (LaMDA) involves tracing the roots of its architectural design and the evolutionary path it followed in the landscape of NLP. LaMDA, like its counterparts, emerges from a family of models that have collectively revolutionized how machines comprehend and generate human-like text.
RNNs, mentioned in this chapter’s first section, play a pivotal role in LaMDA’s family tree.
The breakthrough came with the invention of transformer architectures, and LaMDA owes a significant debt to the transformative Attention Is All You Need paper [Vaswani2017, 2023]. This paper laid the groundwork for a novel approach, moving away from sequential processing to a more parallelized and attention-based mechanism.
The LaMDA LLM inherits its core architecture from the transformer family and was developed by Google. These models learn very well how words in a sentence relate to each other. This allows a transformer to have a richer understanding of language. This change from using traditional processing in sequence was a paradigm shift in NLP, enabling LaMDA to more effectively grasp nuanced interactions and dependencies within texts.
While the origins lie in the transformer architecture, LaMDA’s unique characteristics may have been fine-tuned and evolved through subsequent research and development efforts. LaMDA’s lineage is not just a linear progression but a family tree, a branching exploration of many possibilities, with each iteration refining and expanding its capabilities. In Figure 1.1, LaMDA is near ERNIE 3.0, Gopher, and PaLM on the right of the main, vertical blue branch.
Simply put, LaMDA is a product of ongoing innovation and refinement in the field of AI, standing on the shoulders of earlier models and research breakthroughs. Its ability to comprehend and generate language is deeply rooted in an evolutionary process of learning from vast amounts of text data, mimicking the way humans process and understand language on a grand, digital scale.
LaMDA was launched in May 2021.
LLaMA is the AI brainchild of Meta AI. It might not be one you’ve heard the most about but its lineage holds stories of innovation and evolution, tracing a fascinating path through the history of AI communication.
Like the other chatbot LLMs, LLaMA’s roots are also in transformer architectures. These models rely on intricate attention mechanisms, allowing them to analyze relationships between words, not just their sequence.
Trained on massive datasets of text and code, LLaMA learned to generate basic responses, translate languages, and even write different kinds of creative text formats.
However, like a newborn foal, their capabilities were limited. They stumbled with complex contexts, lacked common sense reasoning, and sometimes sputtered out nonsensical strings.
Yet their potential was undeniable. The ability to learn and adapt from data made them valuable tools for researchers. Meta AI nurtured these nascent models, carefully tweaking their architecture and feeding them richer datasets. They delved deeper into the understanding of human language, acquiring skills such as factual grounding, reasoning, and the ability to engage in multi-turn conversations (Wiki_llama).
The Llama family tree is not a linear progression but, rather, a family of multiple branches of exploration. Different versions explored specific avenues: Code Llama focused on code generation, while Megatron-Turing NLG 530 B was trained on filling in missing words, reading comprehension, and common-sense reasoning, among other things (CodeLlama 2023, Megatron-Turing 2022).
For an idea of how LLaMA fits into the evolutionary tree, see Figure 1.1 at the top left of the vertical blue branch, near Bard (Gemini).
Each experiment, each successful leap forward, contributed valuable DNA to future generations.
Why the name Megatron-Turing NLG 530 B? Megatron because it represents a powerful hardware and software framework. Turing to honor Alan Turing, the first AI researcher, and the originator of AI and ML. NLG stands for natural language generation, and it has 530 billion parameters.
Meta AI continues to shepherd the Llama family, and the future promises more exciting developments.
Llama LLM was launched in February 2023, while Megatron-Turing NLG 530 B was released in January 2022.
Now that we have covered the origins and explored the early stages of LLMs, let us fast-forward and talk about modern LLMs in the next section.
After the explosive take-off of ChatGPT in late 2022, with 1 million active users in 5 days and 100 million active users in January 2023 (about 2 months), 2023 was a pretty hot year for LLMs, AI research, and the use of AI in general.
Most tech companies have worked on their own LLMs or transformer models to use and make publicly available. Many companies, organizations, and individuals (students included) have used LLMs for a multitude of tasks. OpenAI keeps updating its GPT family and Google keeps updating its Bard version. Bard became Gemini in February 2024, so all references to Bard have changed to Gemini. Many companies use ChatGPT or GPT-4 as the core of their offering, just creating a wrapper and selling it.
This might change as OpenAI keeps adding modalities (speech, image, etc.) to the GPTs and even a new marketplace platform where users can create and sell their own GPT agents right on OpenAI servers. This was launched in early January 2024 to paid users ($20/month before VAT). We’ll cover some of the latest LLMs that companies have worked on in the following sections.
GPT-4 Turbo, OpenAI’s latest hot chatbot, is another big upgrade. It’s the GPT-4 you know, but on steroids, with 10 times more memory and a newfound understanding of images.
If GPT-4 was a gifted writer, GPT-4 Turbo is a multimedia polymath. It can not only spin captivating stories and poems but also decipher images, paint vivid digital landscapes, and even caption photos with witty remarks. Forget outdated information – Turbo’s knowledge base refreshes constantly, keeping it as sharp as a tack on current events.
But it’s not just about flashy tricks. Turbo is a stickler for facts. It taps into external knowledge bases and employs sophisticated reasoning, ensuring its responses are accurate and reliable. Gone are the days of biased or misleading outputs – Turbo strives for truth and clarity, making it a trustworthy companion for learning and exploration.
The best part? OpenAI isn’t keeping this powerhouse locked away. They’ve crafted an API and developer tools, inviting programmers and innovators to customize Turbo for specific tasks and domains. This democratization of advanced language processing opens doors to a future where everyone, from artists to scientists, can harness the power of language models to create, analyze, and understand the world around them.
GPT-4 Turbo is probably widely considered the pinnacle of technology at the moment, showing us the breathtaking potential of LLMs. It’s not just a language model; it’s a glimpse into a future where machines understand and interact with us like never before. So, buckle up! The future of language is here, and it’s powered by GPT-4 Turbo.
GPT-4 was launched in March 2023 and GPT-4 Turbo in November 2023 (Wiki_GPT4, OpenAI_GPT4Turbo, Gemini).
GPT-4o or GPT-4 omni was released in May 2024, and it can understand multiple formats of data. Omni is faster than previous models and can respond to speech in 0.32 seconds on average, similar to human response times, while Turbo takes about 5.4 seconds to respond in Voice Mode.
This is partially because, while Turbo takes in text, transcribed from the audio by a simple model, and a third model converts the text back into audio response, omni is a single model that understands audio, video, and text. The three models for Turbo are slower than omni and a lot of information is lost to GPT-4 Turbo due to transcription.
GPT-4o is much better than GPT-4 Turbo in non-English human languages.
The Omni API is also half the cost of Turbo (OpenAI-GPT-4o)!
GPT-4o does very well on code generation versus Claude 3 Opus and Gemini 1.5 Pro. Claude is moderate, Gemini is judged to be very good, and GPT-4o is excellent [encord].
OpenAI has not released details of the architecture and full details of GPT-4, proprietary information for now, but we can piece together elements from similar work.
GPT-4 has 1.75 trillion parameters (1.75 million million) (MotiveX_Gemini).
The vision transformer will likely involve some encoder-decoder architecture: image and video inputs for the encoder, then the decoder will generate output such as text descriptions or captions as well as images (Gemini).
It will have an attention mechanism because “attention is all you need.”
The vision components will probably multi-head to process various aspects of the input simultaneously. There should also be positional encoding, image pro-processing layers, and modality fusion.
Modality fusion is where the vision capabilities are combined with the faculties to process text. From this, it would need to generate a unified understanding of the inputs or the scene given to it.
So, GPT-4 can understand images, and it’s believed that it uses a combination of Vision Transformer (ViT) and Flamingo visual language models.
Figure 1.3 shows the architecture of ViT (reproduced from Wagh).
Figure 1.3: This is what the internal workings of ViT involve (reproduced from Wagh)
So, the inner workings of GPT-4 that handle vision processing likely involve visual transformers as shown in the preceding figure, along with the text processors in the How an LLM processes a sentence subsection.
You can find out more about ViT here: https://github.com/lucidrains/vit-pytorch.
The latest official LLaMA, LLaMA-2, is capable of holding complicated conversations, generating various creative text formats, and even adapting its responses to specific user personalities.
OpenLLaMA is an open source version of LLaMA released by Open LM Research (Watson 2023, OpenLMR, Gemini). OpenLLaMA has several versions, each trained on different datasets but the training process was very similar to the original LLaMA. Model weights can be found on the HuggingFace Hub and accessed without the need for any additional permission. The HuggingFace page for Open LLaMA is here: https://huggingface.co/docs/transformers/en/model_doc/open-llama.
OpenLLaMA models serve as benchmarks for LLM research. Their open source nature makes it possible to compare with other models. This is made easier because there are PyTorch and TensorFlow formats available.LLaMA-2 was released in April 2023.OpenLLaMA was released in June 2023.In early 2024, the rumors are that LLaMA-3 will be released this year.Google’s Gemini is a chatbot LLM with access to the internet and just requires a Google login. Technically, Gemini is the face and the brain is whatever Google slots in.
Previously, Gemini was powered by PaLM 2.
As of writing (early February 2024), Bard was earlier powered by Gemini. There are three versions of Gemini: Nano, Pro, and Ultra. Nano is for mobile devices. As Bard is powered by Gemini Pro, the name changed to Gemini. There may soon be a paid version.
Gemini was released in March 2023 (Wiki_Gemini).
Gemini has 142.4 million users, 62.6% of which are in the USA (AnswerIQ).
Gemini is one of the LLMs and AIs developed and used by Google/Alphabet. Let’s take a peek under the hood to understand what makes Gemini tick!
Gemini is trained on a vast library of the world’s books, articles, and internet chatter. 1.56 trillion words are in the Infiniset dataset of Google Gemini; that’s 750 GB of data. Gemini has 137 billion parameters, which are the neural network weights (ChatGPT has 175 billion parameters/weights) (ProjectPro).
In November 2023, Bard got an upgrade and started to be powered by Gemini, a new AI system (SkillLeapAI). Previously, Gemini was powered by LaMDA from March 2023, then PaLM 2 from May 2023.
There are three models, Gemini Nano, Gemini Pro, and Gemini Ultra. As of 19th January 2024, Gemini is powered by Gemini Ultra, which was launched in December 2023.
Figure 1.4 shows the architecture of Gemini (GeminiTeam).
Figure 1.4: Bard/Gemini architecture, from the DeepMind GeminiTeam (GeminiTeam)
Gemini can deal with combinations of text, images, audio, and video inputs, which are represented as different colors here. Outputs can be text and images combined.
The transition to Gemini Ultra signifies a significant leap in Gemini’s capabilities, offering higher performance, greater efficiency, and a wider range of potential applications (Gemini). Bard/Gemini Ultra has a complex architecture that is like a sophisticated language processing factory, with each component playing a crucial role in understanding your questions and crafting the perfect response.
The key component is the transformer decoder, the brain of the operation. It analyzes the incoming text, dissecting each word’s meaning and its connection to others. It’s like a skilled translator, deciphering the message you send and preparing to respond fluently.
The Gemini Ultra multimodal encoder can handle more than just text. Images, audio, and other data types can be processed, providing a richer context for the decoder. This allows Gemini to interpret complex situations, such as describing an image you send or composing music based on your mood.
To polish the decoder’s output, pre-activation and post-activation transformers come into play. These additional layers refine and smoothen the response, ensuring it’s clear, grammatically correct, and reads like natural, human language. With less hallucination, the factual grounding module anchors its responses in the real world. Just like a reliable teacher, it ensures Gemini’s information is accurate and unbiased, grounding its creativity in a strong foundation of truth. Beyond basic understanding, Gemini Ultra also has reasoning abilities. It can answer complex questions, draw logical conclusions, and even solve problems.
The implementation that is Gemini also has a little link to Google to help users to fact-check its responses. At the bottom of the output, above the input window, Google enables you to double-check its response.
Figure 1.5: Gemini’s Google search button to fact-check the output it gives you
Click this and it says Google search and outputs some search results and a guide to what you’re seeing.
Figure 1.6: Google search based on its output
Figure 1.7 shows what the highlighting means.
Figure 1.7: Understanding the results of the Google search to help fact-check
On your Gemini screen, you’ll see various passages highlighted in brown or green. The green-highlighted text has results agreeing, the brown-highlighted text doesn’t agree with the sources, and no highlight means not enough information to confirm.
This is just a simplified glimpse into Gemini Ultra’s architecture and functioning. With its massive parameter count, self-attention mechanisms, and fine-tuning capabilities, it’s a constantly evolving language maestro, pushing the boundaries of what LLMs can achieve.
Amazon has developed an enormous new LLM. It’s a hulking beast, dwarfing even OpenAI’s GPT-4 in sheer size. But this isn’t just a power contest. Olympus aims for something more: a significant leap in coherence, reasoning, and factual accuracy. Their chatbot, Metis is powered by Olympus: https://happyfutureai.com/amazons-metis-a-new-ai-chatbot-powered-by-olympus-llm/.
With no half-baked ideas, Olympus digs deep, thinks logically, and double-checks its facts before uttering a word. Amazon is purportedly working to reduce bias and misinformation. This LLM strives for high levels of wisdom and reliability.
It’s not just about bragging rights for Amazon. Olympus represents a potential turning point for language models.
The aim is to be able to tackle complex tasks with pinpoint accuracy, grasp subtle nuances of meaning, and engage in intelligent, fact-based conversations with other AI.
Olympus will, hopefully, be a more thoughtful companion capable of deeper understanding and insightful exchange.
Olympus may not be ready to join your book club just yet, but its story is worth watching. Hopefully, Olympus will be a needed advancement for LLMs and not hallucinate, only producing truth and changing what LLMs can do.
Amazon Olympus should have around two trillion parameters (weights and biases) (Life_Achritecture).
Amazon Olympus is expected in the second half of 2024 but not much information has come out since November 2023.
Now that we have introduced many of the modern LLMs, let’s look at how they work, including using an example piece of text.
Moving on to the general transformers, Figure 1.8 shows the structure of a Transformer:
Figure 1.8: Architecture of a Transformer: an encoder for the inputs and a decoder for the outputs (reproduced from Zahere)
You can see that it has an encoder and a decoder. The encoder learns the patterns in the data and the decoder tries to recreate them.
The encoder has multiple neural network layers. In transformers, each layer uses self-attention, allowing the encoder to understand how the different parts of the sentence fit together and understand the context.
Here is a quick version of the transformer process:
Encoder network:Uses multiple layers of neural networks.
Each layer employs self-attention to understand relationships between sentence parts and context.
Creates a compressed representation of the input.
Decoder network:Utilizes the encoder’s representation for generating new outputs.
Employs multiple layers with cross-attention for information exchange with the encoder.
Generates meaningful outputs such as translations, summaries, or answers based on input.
Encoder-decoder partnership:Combined, they power the transformer for various tasks with high accuracy and flexibility.
For example, Microsoft Bing leverages GPT-4, a transformer model, to understand user intent and context beyond keywords for delivering relevant search results.
Beyond keywords:Bing transforms from a search engine to an AI-powered copilot using GPT-4.
It interprets questions and requests by analyzing context and intent, not just keywords.
For example, instead of only providing ingredient lists, it recommends personalized recipes considering dietary needs and skill levels.
From links to understanding:Bing evolves beyond finding links to comprehending user needs and delivering relevant, helpful information.
Next is the detailed version of the Transformer process.
The encoder produces a compressed representation of the input. This allows the decoder to not only consider its own outputs but also look back at the encoder’s representation, which contains a representation of the whole input sequence for guidance. This is used by the decoder for each step of its output generation.
The decoder uses output from the encoder to generate a new output sequence. Because of Transformers, modern LLMs can hold entire sentences or paragraphs in their attention, not just one word at a time like RNNs.
Again, this section has lots of layers but, this time, there is cross-attention.
This back-and-forth conversation between the decoder and the encoder’s compressed knowledge empowers the decoder to generate meaningful and relevant outputs, such as translating a sentence to another language, summarizing a paragraph, or answering a question based on the input.
Together, the encoder and decoder form the powerhouse of the transformer, enabling it to perform a wide range of tasks with remarkable accuracy and flexibility.
Microsoft’s Bing search engine uses GPT-4 to deliver more relevant search results, understanding your intent and context beyond just keywords.
Bing has gone from a search engine to an AI-powered copilot with the help of GPT-4. This powerful language model acts as Bing’s brain, understanding your questions and requests not just through keywords, but by analyzing the context and intent.
You can, for example, ask for a recipe instead of just ingredients; GPT-4 scours the web, considers your dietary needs and skill level, and then presents a personalized selection. It’s like having a knowledgeable friend helping you navigate the vast ocean of information. So, Bing isn’t just about finding links anymore; it’s about understanding what you truly need and delivering it in a way that’s relevant and helpful (https://www.bing.com/).
The whole process of getting a paragraph into an LLM goes like this:
CleaningTokenizationWord-to-number conversion (words given indices: 1, 2, 3, 4…)Numbers are turned into vectorsContextual embeddingContext vectors are formedAttention vectors are formed and fed into final blocksSubsequent words are predicted(ChatGPT, Gemini, Panuganty, Aakanksha).
With this framework in your subconscious, we can go