Using Stable Diffusion with Python - Andrew Zhu (Shudong Zhu) - E-Book

Using Stable Diffusion with Python E-Book

Andrew Zhu (Shudong Zhu)

0,0
32,39 €

-100%
Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.

Mehr erfahren.
Beschreibung

Stable Diffusion is a game-changing AI tool that enables you to create stunning images with code. The author, a seasoned Microsoft applied data scientist and contributor to the Hugging Face Diffusers library, leverages his 15+ years of experience to help you master Stable Diffusion by understanding the underlying concepts and techniques.
You’ll be introduced to Stable Diffusion, grasp the theory behind diffusion models, set up your environment, and generate your first image using diffusers. You'll optimize performance, leverage custom models, and integrate community-shared resources like LoRAs, textual inversion, and ControlNet to enhance your creations. Covering techniques such as face restoration, image upscaling, and image restoration, you’ll focus on unlocking prompt limitations, scheduled prompt parsing, and weighted prompts to create a fully customized and industry-level Stable Diffusion app. This book also looks into real-world applications in medical imaging, remote sensing, and photo enhancement. Finally, you'll gain insights into extracting generation data, ensuring data persistence, and leveraging AI models like BLIP for image description extraction.
By the end of this book, you'll be able to use Python to generate and edit images and leverage solutions to build Stable Diffusion apps for your business and users.

Das E-Book können Sie in Legimi-Apps oder einer beliebigen App lesen, die das folgende Format unterstützen:

EPUB
MOBI

Seitenzahl: 387

Veröffentlichungsjahr: 2024

Bewertungen
0,0
0
0
0
0
0
Mehr Informationen
Mehr Informationen
Legimi prüft nicht, ob Rezensionen von Nutzern stammen, die den betreffenden Titel tatsächlich gekauft oder gelesen/gehört haben. Wir entfernen aber gefälschte Rezensionen.



Using Stable Diffusion with Python

Leverage Python to control and automate high-quality AI image generation using Stable Diffusion

Andrew Zhu (Shudong Zhu)

Using Stable Diffusion with Python

Copyright © 2024 Packt Publishing

All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing or its dealers and distributors, will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book.

Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

Group Product Manager: Niranjan Naikwadi

Publishing Product Manager: Sanjana Gupta

Senior Editor: Tazeen Shaikh

Senior Content Development Editor: Joseph Sunil

Technical Editor: Rahul Limbachiya

Copy Editor: Safis Editing

Project Coordinator: Shambhavi Mishra

Proofreader: Tazeen Shaikh

Indexer: Tejal Daruwale Soni

Production Designer: Alishon Mendonca

Marketing Coordinator: Vinishka Kalra

First published: June 2024

Production reference: 1170524

Published by Packt Publishing Ltd.

Grosvenor House

11 St Paul’s Square

Birmingham

B3 1RB, UK

ISBN 978-1-83508-637-7

www.packtpub.com

To my beloved wife, Yinhua Fan, and our precious sons, Charles Zhu and Daniel Zhu.

You are the spark that ignites my creativity, the fuel that drives my passion, and the love that sustains me throughout this journey. Without your unwavering support, encouragement, and inspiration, this book would not have been possible.

– Andrew Zhu

Foreword

Artificial intelligence has ushered in a new era of creativity, with generative models offering a glimpse into what was once futuristic. Stable Diffusion stands out as an innovative leap forward, blending technical sophistication with practical application in a way that empowers creators across diverse domains.

Andrew Zhu's book is a comprehensive resource for understanding Stable Diffusion's technical underpinnings. He provides an in-depth exploration of its foundational principles, contrasts it with alternative generative models, and demonstrates how to apply it to varied creative fields.

Technologies like Stable Diffusion are catalysts for creativity, offering accelerated workflows for content refinement, editing, and generation. They harness cutting-edge optimization techniques to produce detailed and unique imagery, setting a new standard in the efficiency of image creation and enabling high-quality results that were once only attainable through meticulous manual effort. Whether you're building data-driven applications or experimenting with imaginative visual storytelling, the model's robust and scalable architecture is designed to seamlessly integrate into diverse production environments.

The pace of AI advancements, especially in deep learning, is unparalleled. We're witnessing unprecedented growth as new architectures and optimization techniques continuously redefine the state-of-the-art. Stable Diffusion stands as a testament to this rapid evolution, providing practical solutions that enable independent creators and organizations alike to quickly translate their ideas into reality, often reducing what once took weeks to mere hours.

These are thrilling times to be exploring Stable Diffusion. Andrew’s carefully curated insights will undoubtedly inspire innovative projects that harness the true power of generative AI. Dive into this book with curiosity and enthusiasm, and you'll embark on a journey that will not only inform but ignite new creative possibilities.

Andrew has done groundbreaking work in understanding the nuances of generative models and creating practical methodologies that push the boundaries of what's possible. His book stands as a testament to this ingenuity, providing insights that will guide both enthusiasts and professionals toward the next era of creativity and technological progress.

Enjoy the read! The adventure is just beginning.

-Matt Fisher, Co-Founder and CTO, Dahlia Labs

Contributors

About the author

Andrew Zhu (Shudong Zhu) is a seasoned Microsoft applied data scientist with over 15 years of experience in the tech industry. Renowned for his exceptional ability to distill complex machine learning and AI concepts into engaging, informative narratives, Andrew regularly contributes to esteemed publications such as Toward Data Science. His previous book, Microsoft Workflow Foundation 4.0 Cookbook, earned a commendable 4.5-star average rating on Amazon.

As a contributor to the popular Hugging Face Diffusers library, a leading Stable Diffusion Python library and a primary focus of this book, Andrew brings unparalleled expertise to the table. Currently, he leads the AI department at a stealth start-up company, leveraging his extensive research background and proficiency in generative AI to transform the online shopping experience and pioneer the future of AI in retail.

Outside of his professional pursuits, Andrew resides in Washington State, USA, with his loving family, including his two sons.

About the reviewers

Krishnan Raghavan is an IT professional with over 20 years of experience in the areas of software development and delivery excellence across multiple domains and technologies, ranging from C++ to Java, Python, Angular, Golang, and data warehousing.

When not working, Krishnan likes to spend time with his wife and daughter, as well as reading fiction, nonfiction, and technical books and participating in hackathons. Krishnan tries to give back to the community; he is a part of the GDG – Pune volunteer group, helping the team to organize events.

You can connect with Krishnan at [email protected].

I would like to thank my wife, Anita, and daughter, Ananya, for giving me the time and space to review this book.

Swagata Ashwani is a data professional with over seven years of experience in the healthcare, retail, and platform integration industries. She is an avid blogger and writes about state-of-the-art developments in the AI space. She is particularly interested in Natural Language Processing (NLP) and focuses on researching how to make NLP models work in a practical setting. She is also a podcast host and chapter lead of Women in Data, and loves to advocate for women in technology roles and the responsible use of AI in this fast-paced era. In her spare time, she loves to play her guitar, sip masala chai, and find new spots for doing yoga.

Table of Contents

Preface

Part 1 – A Whirlwind of Stable Diffusion

1

Introducing Stable Diffusion

Evolution of the Diffusion model

Before Transformer and Attention

Transformer transforms machine learning

CLIP from OpenAI makes a big difference

Generate images

DALL-E 2 and Stable Diffusion

Why Stable Diffusion

Which Stable Diffusion to use

Why this book

References

2

Setting Up the Environment for Stable Diffusion

Hardware requirements to run Stable Diffusion

GPU

System memory

Storage

Software requirements

CUDA installation

Installing Python for Windows, Linux, and macOS

Installing PyTorch

Running a Stable Diffusion pipeline

Using Google Colaboratory

Using Google Colab to run a Stable Diffusion pipeline

Summary

References

3

Generating Images Using Stable Diffusion

Logging in to Hugging Face

Generating an image

Generation seed

Sampling scheduler

Changing a model

Guidance scale

Summary

References

4

Understanding the Theory Behind Diffusion Models

Understanding the image-to-noise process

A more efficient forward diffusion process

The noise-to-image training process

The noise-to-image sampling process

Understanding Classifier Guidance denoising

Summary

References

5

Understanding How Stable Diffusion Works

Stable Diffusion in latent space

Generating latent vectors using diffusers

Generating text embeddings using CLIP

Initializing time step embeddings

Initializing the Stable Diffusion UNet

Implementing a text-to-image Stable Diffusion inference pipeline

Implementing a text-guided image-to-image Stable Diffusion inference pipeline

Summary

References

Additional reading

6

Using Stable Diffusion Models

Technical requirements

Loading the Diffusers model

Loading model checkpoints from safetensors and ckpt files

Using ckpt and safetensors files with Diffusers

Turning off the model safety checker

Converting the checkpoint model file to the Diffusers format

Using Stable Diffusion XL

Summary

References

Part 2 – Improving Diffusers with Custom Features

7

Optimizing Performance and VRAM Usage

Setting the baseline

Optimization solution 1 – using the float16 or bfloat16 data type

Optimization solution 2 – enabling VAE tiling

Optimization solution 3 – enabling Xformers or using PyTorch 2.0

Optimization solution 4 – enabling sequential CPU offload

Optimization solution 5 – enabling model CPU offload

Optimization solution 6 – Token Merging (ToMe)

Summary

References

8

Using Community-Shared LoRAs

Technical requirements

How does LoRA work?

Using LoRA with Diffusers

Applying a LoRA weight during loading

Diving into the internal structure of LoRA

Finding the and weight matrix from the LoRA file

Finding the corresponding checkpoint model layer name

Updating the checkpoint model weights

Making a function to load LoRA

Why LoRA works

Summary

References

9

Using Textual Inversion

Diffusers inference using TI

How TI works

Building a custom TI loader

TI in the pt file format

TI in bin file format

Detailed steps to build a TI loader

Putting all of the code together

Summary

References

10

Overcoming 77-Token Limitations and Enabling Prompt Weighting

Understanding the 77-token limitation

Overcoming the 77-tokens limitation

Putting all the code together into a function

Enabling long prompts with weighting

Verifying the work

Overcoming the 77-token limitation using community pipelines

Summary

References

11

Image Restore and Super-Resolution

Understanding the terminologies

Upscaling images using Img2img diffusion

One-step super-resolution

Multiple-step super-resolution

A super-resolution result comparison

Img-to-Img limitations

ControlNet Tile image upscaling

Steps to use ControlNet Tile to upscale an image

The ControlNet Tile upscaling result

Additional ControlNet Tile upscaling samples

Summary

References

12

Scheduled Prompt Parsing

Technical requirements

Using the Compel package

Building a custom scheduled prompt pipeline

A scheduled prompt parser

Filling in the missing steps

A Stable Diffusion pipeline supporting scheduled prompts

Summary

References

Part 3 – Advanced Topics

13

Generating Images with ControlNet

What is ControlNet and how is it different?

Usage of ControlNet

Using multiple ControlNets in one pipeline

How ControlNet works

Further usage

More ControlNets with SD

SDXL ControlNets

Summary

References

14

Generating Video Using Stable Diffusion

Technical requirements

The principles of text-to-video generation

Practical applications of AnimateDiff

Utilizing Motion LoRA to control animation motion

Summary

References

15

Generating Image Descriptions Using BLIP-2 and LLaVA

Technical requirements

BLIP-2 – Bootstrapping Language-Image Pre-training

How BLIP-2 works

Using BLIP-2 to generate descriptions

LLaVA – Large Language and Vision Assistant

How LLaVA works

Installing LLaVA

Using LLaVA to generate image descriptions

Summary

References

16

Exploring Stable Diffusion XL

What’s new in SDXL?

The VAE of the SDXL

The UNet of SDXL

Two text encoders in SDXL

The two-stage design

Using SDXL

Use SDXL community models

Using SDXL image-to-image to enhance an image

Using SDXL LoRA models

Using SDXL with an unlimited prompt

Summary

References

17

Building Optimized Prompts for Stable Diffusion

What makes a good prompt?

Be clear and specific

Be descriptive

Using consistent terminology

Reference artworks and styles

Incorporate negative prompts

Iterate and refine

Using LLMs to generate better prompts

Summary

References

Part 4 – Building Stable Diffusion into an Application

18

Applications – Object Editing and Style Transferring

Editing images using Stable Diffusion

Replacing image background content

Removing the image background

Object and style transferring

Loading up a Stable Diffusion pipeline with IP-Adapter

Transferring style

Summary

References

19

Generation Data Persistence

Exploring and understanding the PNG file structure

Saving extra text data in a PNG image file

PNG extra data storage limitation

Summary

References

20

Creating Interactive User Interfaces

Introducing Gradio

Getting started with Gradio

Gradio fundamentals

Gradio Blocks

Inputs and outputs

Building a progress bar

Building a Stable Diffusion text-to-image pipeline with Gradio

Summary

References

21

Diffusion Model Transfer Learning

Technical requirements

Training a neural network model with PyTorch

Preparing the training data

Preparing for training

Training a model

Training a model with Hugging Face’s Accelerate

Applying Hugging Face’s Accelerate

Putting code together

Training a model with multiple GPUs using Accelerate

Training a Stable Diffusion V1.5 LoRA

Defining training hyperparameters

Preparing the Stable Diffusion components

Loading the training data

Defining the training components

Training a Stable Diffusion V1.5 LoRA

Kicking off the training

Verifying the result

Summary

References

22

Exploring Beyond Stable Diffusion

What sets this AI wave apart

The enduring value of mathematics and programming

Staying current with AI innovations

Cultivating responsible, ethical, private, and secure AI

Our evolving relationship with AI

Summary

References

Index

Other Books You May Enjoy

Part 1 – A Whirlwind of Stable Diffusion

Welcome to the fascinating world of Stable Diffusion, a rapidly evolving field that has revolutionized the way we approach image generation and manipulation. In the first part of our journey, we’ll embark on a comprehensive exploration of the fundamentals, laying the groundwork for a deep understanding of this powerful technology.

Over the next six chapters, we’ll delve into the core concepts, principles, and applications of Stable Diffusion, providing a solid foundation for further experimentation and innovation. We’ll begin by introducing the basics of Stable Diffusion, followed by a hands-on guide to setting up your environment for success. You’ll then learn how to generate stunning images using Stable Diffusion, before diving deeper into the theoretical underpinnings of diffusion models and the intricacies of how Stable Diffusion works its magic.

By the end of this part, you’ll possess a broad understanding of Stable Diffusion, from its underlying mechanics to practical applications, empowering you to harness its potential and create remarkable visual content. So, let’s dive in and discover the wonders of Stable Diffusion!

This part contains the following chapters:

Chapter 1, Introducing Stable DiffusionChapter 2, Setup Environment for Stable DiffusionChapter 3, Generate Images Using Stable DiffusionChapter 4, Understand the Theory behind the Diffusion ModelsChapter 5, Understanding How Stable Diffusion WorksChapter 6, Using the Stable Diffusion Model

1

Introducing Stable Diffusion

Stable Diffusion is a deep learning model that utilizes diffusion processes to generate high-quality artwork from guided instructions and images.

In this chapter, we will introduce you to AI image generation technology, namely Stable Diffusion, and see how it evolved into what it is now.

Unlike other deep learning image generation models, such as OpenAI’s DALL-E 2, Stable Diffusion works by starting with a random-noise latent tensor and then gradually adding detailed information to it. The amount of detail that is added is determined by a diffusion process, governed by a mathematical equation (we will delve into the details in Chapter 5). In the final stage, the model decodes the latent tensor into the pixel image.

Since its creation in 2022, Stable Diffusion has been used widely to generate impressive images. For example, it can generate images of people, animals, objects, and scenes that are indistinguishable from real photographs. Images are generated using specific instructions, such as A cat running on the moon’s surface or a photograph of an astronaut riding a horse.

Here is a sample of a prompt to use with Stable Diffusion to generate an image using the given description:

"a photograph of an astronaut riding a horse".

Stable Diffusion will generate an image like the following:

Figure 1.1: A photograph of an astronaut riding a horse, generated by Stable Diffusion

This image didn’t exist before I hit the Enter button. It was created collaboratively by me and Stable Diffusion. Stable Diffusion not only understands the descriptions we give it, but also adds more detail to the image.

Apart from text-to-image generation, Stable Diffusion also facilitates editing photos using natural language. To illustrate, consider the preceding image again. We can replace the space background with a blue sky and mountains using an automatically generated mask and prompts.

The background prompt can be used to generate the background mask, and the blue sky and mountains prompt is used to guide Stable Diffusion to transform the initial image into the following:

Figure 1.2: Replace the background with a blue sky and mountains

No mouse-clicking or dragging is required, and there's no need for additional paid software such as Photoshop. You can achieve this using pure Python together with Stable Diffusion. Stable Diffusion can perform many other tasks using only Python code, which will be covered later in this book.

Stable Diffusion is a powerful tool that has the potential to revolutionize the way we create and interact with images. It can be used to create realistic images for movies, video games, and other applications. It can also be used to generate personalized images for marketing, advertising, and decoration.

Here are some of the key features of Stable Diffusion:

It can generate high-quality images from text descriptionsIt is based on a diffusion process, which is a more stable and reliable way to generate images than other methodsMany massive pre-trained publicly accessible models are available (10,000+), and keep on growingNew research and applications are building on Stable DiffusionIt is open source and can be used by anyone

Before we proceed, let me provide a brief introduction to the evolution of the Diffusion model in recent years.

Evolution of the Diffusion model

Diffusion hasn’t always been available, just as Rome was not built in a day. To have a high-level bird’s view of this technology, in this section, we will discuss the overall evolution of the Diffusion model in recent years.

Before Transformer and Attention

Not too long ago, Convolutional Neural Networks (CNNs) and Residual Neural Networks (ResNets) dominated the field of computer vision in machine learning.

CNNs and ResNets have proven to be highly effective in tasks such as guided object detection and face recognition. These models have been widely adopted across various industries, including self-driving cars and AI-driven agriculture.

However, there is a significant drawback to CNNs and ResNets: they can only recognize objects that are part of their training set. To detect a completely new object, a new category label must be added to the training dataset, followed by retraining or fine-tuning the pre-trained models.

This limitation stems from the models themselves, as well as the constraints imposed by hardware and the availability of training data at that time.

Transformer transforms machine learning

The Transformer model, developed by Google, has revolutionized the field of computer vision, starting with its impact on Natural Language Processing (NLP).

Unlike traditional approaches that rely on predefined labels to calculate loss and update neural network weights through backpropagation, the Transformer model, along with the Attention mechanism, introduced a pioneering concept. They utilize the training data itself for both training and labeling purposes.

Let’s consider the following sentence as an example:

“Stable Diffusion can generate images using text”

Let’s say we input the sequence of words into the neural network, excluding the last word text:

“Stable Diffusion can generate images using”

Using this prompt, the model can predict the next word based on its current weights. Let’s say it predicts apple. The encoded embedding of the word apple is significantly different from text in terms of vector space, much like two numbers with a large gap between them. This gap value can be used as the loss value, which is then backpropagated to update the weights.

By repeating this process millions or even billions of times during training and updating, the model’s weights gradually learn to produce the next reasonable words in a sentence.

Machine learning models can now learn a wide range of tasks with a properly designed loss function.

CLIP from OpenAI makes a big difference

Researchers and engineers quickly recognized the potential of the Transformer model, as mentioned in the concluding remarks of the well-known machine learning paper titled Attention Is All You Need [2]. The author states the following:

We are excited about the future of Attention-based models and plan to apply them to other tasks. We plan to extend the Transformer to problems involving input and output modalities other than text and to investigate local, restricted Attention mechanisms to efficiently handle large inputs and outputs such as images, audio, and video.

If you have read the paper and grasped the remarkable capabilities of Transformer- and Attention-based models, you might also be inspired to reimagine your own work and harness this extraordinary power.

Researchers from OpenAI grasped this power and created a model called CLIP [1] that uses the Attention mechanism and Transformer model architecture to train an image classification model. The model has the ability to classify a wide range of images with no need for labeled data. It is the first large-scale image classification model trained on 400 million image-text pairs extracted from the internet.

Although there were similar efforts prior to OpenAI’s CLIP model, the results were not deemed satisfactory according to the authors of the CLIP paper [1]:

A crucial difference between these weakly supervised models and recent explorations of learning image representations directly from natural language is scale.

Indeed, scale plays a pivotal role in unlocking the remarkable superpower of universal image recognition. While other models utilized 200,000 images, the CLIP team trained their model using a staggering 400,000,000 images combined with text data from the public internet.

The results are astonishing. CLIP enables image recognition and segmentation without the limitations of predefined labels. It can detect objects that previous models struggled with. CLIP has brought about a significant change through its large-scale model. Given the immense weight of CLIP, researchers have pondered whether it could also be employed for image generation from text.

Generate images

Using only CLIP, we still cannot generate a realistic image based on a text description. For instance, if we ask CLIP to draw an apple, the model merges various types of apples, different shapes, colors, backgrounds, and so on. CLIP might generate an apple that is half green and half red, which might not be what we intended.

You may be familiar with Generative Adversarial Networks (GANs), which are capable of generating highly photorealistic images. However, text prompts cannot be utilized in the generation process. GANs have become a sophisticated solution for image processing tasks such as face restoration and image upscaling. Nevertheless, a new innovative approach was needed to leverage models for image generation based on guided descriptions or prompts.

In June 2020, a paper titled Denoising Diffusion Probabilistic Models [3] by Jonathan Ho et al. introduced a diffusion-based probabilistic model for image generation. The term diffusion is borrowed from thermodynamics. The original meaning of diffusion is the movement of particles from a region of high concentration to a region of low concentration. This idea of diffusion inspired machine learning researchers to apply it to denoising and sampling processes. In other words, we can start with a noisy image and gradually refine it by removing noise. The denoising process gradually transforms an image with high levels of noise into a clearer version of the original image. Therefore, this generative model is referred to as a denoising diffusion probabilistic model.

The idea behind this approach is ingenious. For any given image, a limited number of normally distributed noise images are added to the original image, effectively transforming it into a fully noisy image. What if we train a model that can reverse this diffusion process, guided by the CLIP model? Surprisingly, this approach works [4].

DALL-E 2 and Stable Diffusion

In April 2022, OpenAI released DALL-E 2, accompanied by its paper titled Hierarchical Text-Conditional Image Generation with CLIP Latents [4]. DALL-E 2 garnered significant attention worldwide. It generated a massive collection of astonishing images that spread across social networks and mainstream media. People were not only amazed by the quality of the generated images but also by its ability to create images that had never existed before. DALL-E 2 was effectively producing works of art.

Perhaps coincidentally, in April 2022, a paper titled High-Resolution Image Synthesis with Latent Diffusion Models [5] was published by CompVis, introducing another diffusion-based model for text-guided image generation. Building upon CompVis’s work, researchers and engineers from CompVis, Stability AI, and LAION collaborated to release an open source counterpart of DALL-E 2 called Stable Diffusion in August 2022.

Why Stable Diffusion

While DALL-E 2 and other commercial image generation models such as Midjourney can produce remarkable images without requiring complex environment setups or hardware preparation, these models are closed-source. Consequently, users have limited control over the generation process, cannot use their own customized models, and are unable to add custom functions to the platform.

On the other hand, Stable Diffusion is an open source model released under the CreativeML Open RAIL-M license. Users not only have the freedom to utilize the model but can also read the source code, add features, and benefit from the countless custom models shared by the community.

Which Stable Diffusion to use

When we say Stable Diffusion, which Stable Diffusion are we really referring to? Here’s a list of the different Stable Diffusion tools and the differences between them:

Stable Diffusion GitHub repo (https://github.com/CompVis/stable-diffusion): This is the original implementation of Stable Diffusion from CompVis, contributed to by many great engineers and researchers. It is a PyTorch implementation that can be used to train and generate images, text, and other creative content. The library is now less active at the time of writing in 2023. Its README page also recommends users use Diffusers from Hugging Face to use and train Diffusion models.Diffusers from Hugging Face: Diffusers is a library for training and using diffusion models developed by Hugging Face. It is the go-to library for state-of-the-art, pre-trained diffusion models for generating images, audio, and even the 3D structures of molecules. The library is well maintained and being actively developed at the time of writing. New code is added to its GitHub repository almost every day.Stable Diffusion WebUI from AUTOMATIC1111: This might be the most popular web-based application currently that allows users to generate images and text using Stable Diffusion. It provides a GUI interface that makes it easy to experiment with different settings and parameters.InvokeAI: InvokeAI was originally developed as a fork of the Stable Diffusion project, but it has since evolved into its own unique platform. InvokeAI offers a number of features that make it a powerful tool for creatives.ComfyUI: ComfyUI is a node-based UI that utilizes Stable Diffusion. It allows users to construct tailored workflows, including image post-processing and conversions. It is a potent and adaptable graphical user interface for Stable Diffusion, characterized by its node-based design.

In this book, when I use Stable Diffusion, I am referring to the Stable Diffusion model, not the GUI tools just listed. The focus of this book will be on using Stable Diffusion with plain Python. Our example code will use Diffusers’ pipelines and will leverage the code from Stable Diffusion WebUI and open source code from academic papers, et cetera.

Why this book

While the Stable Diffusion GUI tool can generate fantastic images driven by the Diffusion model, its usability is limited. The presence of dozens of knobs (more sliders and buttons are being added) and specific terms sometimes makes generating high-quality images a guessing game. On the other hand, the open source Diffusers package from Hugging Face gives users full control over Stable Diffusion using Python. However, it lacks many key features such as loading custom LoRA and textual inversion, utilizing community-shared models/checkpoints, scheduling, and weighted prompts, unlimited prompt tokens, and high-resolution image fixing and upscaling (The Diffusers package does keep improving over time, however).

This book aims to help you understand all the complex terms and knobs from the internal view of the Diffusion model. The book will also assist you in overcoming the limitations of Diffusers and implementing the missing functions and advanced features to create a fully customized Stable Diffusion application.

Considering the rapid pace of AI technology evolution, this book also aims to enable you to quickly adapt to the upcoming changes.

By the end of this book, you will not only be able to use Python to generate and edit images but also leverage the solutions provided in the book to build Stable Diffusion applications for your business and users.

Let’s start the journey.

References

Learning Transferable Visual Models From Natural Language Supervision: https://arxiv.org/abs/2103.00020Attention Is All You Need: https://arxiv.org/abs/1706.03762Denoising Diffusion Probabilistic Models: https://arxiv.org/abs/2006.11239Hierarchical Text-Conditional Image Generation with CLIP Latents: https://arxiv.org/abs/2204.06125v1High-Resolution Image Synthesis with Latent Diffusion Models: https://arxiv.org/abs/2112.10752DALL-E 2: https://openai.com/dall-e-2

2

Setting Up the Environment for Stable Diffusion

Welcome to Chapter 2. In this chapter, we will be focusing on setting up the environment to run Stable Diffusion. We will cover all the necessary steps and aspects to ensure a seamless experience while working with Stable Diffusion models. Our primary goal is to help you understand the importance of each component and how they contribute to the overall process.

The contents of this chapter are as follows:

Introduction to the hardware requirements to run Stable DiffusionDetailed steps to install the required software dependencies: CUDA from NVIDIA, Python, a Python virtual environment (optional but recommended), and PyTorchAlternative options for users without a GPU, such as Google Colab and Apple MacBook with silicon CPU (M series)Troubleshooting common issues during the setup processTips and best practices for maintaining a stable environment

We will begin by providing an overview of Stable Diffusion, its significance, and its applications in various fields. This will help you gain a better understanding of the core concept and its importance.

Next, we will dive into the step-by-step installation process for each dependency, including CUDA, Python, and PyTorch. We will also discuss the benefits of using a Python virtual environment and guide you through setting one up.

For those who do not have access to a machine with a GPU, we will explore alternative options such as Google Colab. We will provide a comprehensive guide to using these services and discuss the trade-offs associated with them.

Finally, we will address common issues that may arise during the setup process and provide troubleshooting tips. Additionally, we will share best practices for maintaining a stable environment to ensure a smooth experience while working with Stable Diffusion models.

By the end of this chapter, you will have a solid foundation for setting up and maintaining an environment tailored for Stable Diffusion, allowing you to focus on building and experimenting with your models efficiently.

Hardware requirements to run Stable Diffusion

This section will discuss the hardware requirements of running a Stable Diffusion model. This book will cover Stable Diffusion v1.5 and the Stable Diffusion XL (SDXL) version. These two are also the most used models at the time of writing this book.

Stable Diffusion v1.5, released in October 2022, is considered a general-purpose model, and can be used interchangeably with v1.4. On the other hand, SDXL, which was released in July 2023, is known for its ability to handle higher resolutions more effectively compared to Stable Diffusion v1.5. It can generate images with larger dimensions without compromising on quality.

Essentially, Stable Diffusion is a set of models that includes the following:

Tokenizer: This tokenizes a text prompt into a sequence of tokens.Text Encoder: The Stable Diffusion text encoder is a special Transformer language model – specifically, the text encoder of a CLIP model. In SDXL, a larger-size OpenCLIP [6] text encoder is also used to encode the tokens into text embeddings.Variational Autoencoder (VAE): This encodes images into a latent space and decodes them back into images.UNet: This is where the denoising process happens. The UNet structure is employed to comprehend the steps involved in the noising/denoising cycle. It accepts certain elements such as noise, time step data, and a conditioning signal (for instance, a representation of a text description), and forecasts noise residuals that can be utilized in the denoising process.

The components of Stable Diffusion provide neural network weight data, except for the tokenizer. While the CPU can handle the training and inference in theory, a physical machine with a GPU or parallel computing device can provide the best experience to learn and run Stable Diffusion models.

GPU

In theory, Stable Diffusion models can run on both GPU and CPU. In reality, PyTorch-based models work best on an NVIDIA GPU with CUDA.

Stable Diffusion requires a GPU with at least 4 GB VRAM. From my own experience, a GPU with 4 GB VRAM can only enable you to generate 512x512 images but it may take a long time to generate them. A GPU with at least 8 GB VRAM grants a relatively pleasant learning and usage experience. The larger the VRAM size, the better.

The code of this book is tested on NVIDIA RTX 3070Ti with 8 GB VRAM and RTX 3090 with 24 GB VRAM.

System memory

There will be a lot of data transferred between GPU and CPU, and some Stable Diffusion models will easily take up to 6 GB RAM. Please prepare at least 16 GB of system RAM; 32 GB RAM will be good – the more, the better, especially for multiple models.

Storage

Do prepare a large drive. By default, the Hugging Face package will download model data to a cache folder located in the system drive. If you only have 256 GB or 512 GB storage, you will find it quickly running out. Preparing a 1 TB NVME SSD is recommended, although 2 TB or more will beeven better.

Software requirements

Now we have the hardware prepared, Stable Diffusion requires additional software to support its execution and provide better control using Python. This section will provide you with the steps to prepare the software environment.

CUDA installation

If you are using Microsoft Windows, please install Microsoft Visual Studio (VS) [5] first. VS will install all other dependent packages and binary files for CUDA. You can simply choose the latest Community version of VS for free.

Now, go to the NVIDIA CUDA download page [1] to get the CUDA installation file. The following screenshot shows an example of selecting CUDA for Windows 11:

Figure 2.1: Selecting the CUDA installation download file for Windows

Download the CUDA installation file, then double-click this file to install CUDA like any other Windows application.

If you are using a Linux operating system, installing CUDA for Linux is slightly different. You can execute the Bash script provided by NVIDIA to automate the installation. Here are the detailed steps:

It is better to uninstall all NVIDIA drivers first to ensure minimum errors, so if you have NVIDIA’s driver already installed, use the following command to uninstall it: sudo apt-get purge nvidia* sudo apt-get autoremove

Then, reboot your system:

sudo rebootInstall GCC. GNU Compiler Collection (GCC) is a set of compilers for various programming languages such as C, C++, Objective-C, Fortran, Ada, and others. It is an open source project developed by the GNU Project and is widely used for compiling and building software on Unix-like operating systems, including Linux. Without GCC being installed, we will get errors during the CUDA installation. Install it with the following command: sudo apt install gccSelect the right CUDA version for your system on the CUDA download page [2]. The following screenshot shows an example of selecting CUDA for Ubuntu 22.04:

Figure 2.2: Selecting the CUDA installation download file for Linux

After your selection, the page will show you the command scripts that handle the entire installation process. Here is one example:

wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-ubuntu2204.pin sudo mv cuda-ubuntu2204.pin /etc/apt/preferences.d/cuda-repository-pin-600 wget https://developer.download.nvidia.com/compute/cuda/12.1.1/local_installers/cuda-repo-ubuntu2204-12-1-local_12.1.1-530.30.02-1_amd64.deb sudo dpkg -i cuda-repo-ubuntu2204-12-1-local_12.1.1-530.30.02-1_amd64.deb sudo cp /var/cuda-repo-ubuntu2204-12-1-local/cuda-*-keyring.gpg /usr/share/keyrings/ sudo apt-get update sudo apt-get -y install cuda

Note

The script may have been updated by the time you read this book. To avoid errors and potential installation failures, I would suggest opening the page and using the script that reflects your selection.

Installing Python for Windows, Linux, and macOS

We will first install Python for Windows.

Installing Python for Windows

You can visit https://www.python.org/ and download Python 3.9 or Python 3.10 to install it.

After years of manually downloading and clicking through the installation process, I found that using a package manager is quite useful to automate the installation. With a package manager, you write a script once, save it, and then the next time you need to install the software, all you have to do is run the same script in a terminal window. One of the best package managers for Windows is Chocolatey (https://chocolatey.org/).

Once you have Chocolatey installed, use the following command to install Python 3.10.6:

choco install python --version=3.10.6

Create a Python virtual environment:

pip install --upgrade --user pip pip install virtualenv python -m virtualenv venv_win_p310 venv_win_p310\Scripts\activate python -m ensurepip python -m pip install --upgrade pip

We will move on to the steps to install Python for Linux.

Installing Python for Linux

Let’s now install Python for Linux (Ubuntu). Follow these steps:

Install the required packages: sudo apt-get install software-properties-common sudo add-apt-repository ppa:deadsnakes/ppa sudo apt-get update sudo apt-get install python3.10 sudo apt-get install python3.10-dev sudo apt-get install python3.10-distutilsInstall pip: curl https://bootstrap.pypa.io/get-pip.py -o get-pip.py python3.10 get-pip.pyCreate a Pythonvirtual environment: python3.10 -m pip install --user virtualenv python3.10 -m virtualenv venv_ubuntu_p310 . venv_ubuntu_p310/bin/activate

Installing Python for macOS

If you are using a Mac with the silicon chip inside (with Apple Mx CPU), there is a high chance you have Python installed already. You can test whether you have Python installed on your Mac with the following command:

python3 --version

If your machine doesn’t have a Python interpreter yet, you can install it with one simple command using Homebrew [7] like this: