29,99 €
LLVM was built to bridge the gap between the theoretical knowledge found in compiler textbooks and the practical demands of compiler development. With a modular codebase and advanced tools, LLVM empowers developers to build compilers with ease. This book serves as a practical introduction to LLVM, guiding you progressively through complex scenarios and ensuring that you navigate the challenges of building and working with compilers like a pro.
The book starts by showing you how to configure, build, and install LLVM libraries, tools, and external projects. You’ll then be introduced to LLVM's design, unraveling its applications in each compiler stage: frontend, optimizer, and backend. Using a real programming language subset, you'll build a frontend, generate LLVM IR, optimize it through the pipeline, and generate machine code. Advanced chapters extend your expertise, covering topics such as extending LLVM with a new pass, using LLVM tools for debugging, and enhancing the quality of your code. You'll also focus on just-in-time compilation issues and the current state of JIT-compilation support with LLVM. Finally, you’ll develop a new backend for LLVM, gaining insights into target description and how instruction selection works.
By the end of this book, you'll have hands-on experience with the LLVM compiler development framework through real-world examples and source code snippets.
Das E-Book können Sie in Legimi-Apps oder einer beliebigen App lesen, die das folgende Format unterstützen:
Veröffentlichungsjahr: 2024
Learn LLVM 17
A beginner’s guide to learning LLVM compiler tools and core libraries with C++
Kai Nacke
Amy Kwan
BIRMINGHAM—MUMBAI
Copyright © 2024 Packt Publishing
All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.
Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the authors, nor Packt Publishing or its dealers and distributors, will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book.
Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.
Group Product Manager: Kunal Sawant
Publishing Product Manager: Teny Thomas
Book Project Manager: Prajakta Naik
Senior Editor: Ruvika Rao and Nithya Sadanandan
Technical Editor: Jubit Pincy
Copy Editor: Safis Editing
Indexer: Pratik Shirodkar
Production Designer: Vijay Kamble
DevRel Marketing Coordinator: Shrinidhi Manoharan
Business Development Executive: Kriti Sharma
First published: April 2021
Second published: January 2024
Production reference: 1271223
Published by Packt Publishing Ltd.
Grosvenor House
11 St Paul’s Square
Birmingham
B3 1R.
ISBN 978-1-83763-134-6
www.packtpub.com
Writing a book takes time and energy. Without the support and understanding of my wife, Tanya, and my daughter Polina, this book would not have been possible. Thank you both for always encouraging me!
Because of some personal challenges, this project was at risk, and I am grateful to Amy for joining me as an author. Without her, the book would not be as good as it is now.
Once again, the team at Packt not only provided guidance on my writing but also showed an understanding of my slow writing, and always motivated me to carry on. I owe them a great thank you.
- Kai Nacke
2023 has been a very transformative year for me, and contributing my knowledge of LLVM to this book has been one of the reasons why this year has been so significant. I never would have thought that I would be approached by Kai to embark on this exciting journey to share LLVM 17 with you all! Thank you to Kai, for his technical mentorship and guidance, the team at Packt, and, of course, to my family and close loved ones for providing me with the support and motivation in writing this book.
- Amy Kwan
Kai Nacke is a professional IT architect currently residing in Toronto, Canada. He holds a diploma in computer science from the Technical University of Dortmund, Germany. and his diploma thesis on universal hash functions was recognized as the best of the semester.
With over 20 years of experience in the IT industry, Kai has extensive expertise in the development and architecture of business and enterprise applications. In his current role, he evolves an LLVM/clang-based compiler.
For several years, Kai served as the maintainer of LDC, the LLVM-based D compiler. He is the author of D Web Development and Learn LLVM 12, both published by Packt. In the past, he was a speaker in the LLVM developer room at the Free and Open Source Software Developers’ European Meeting (FOSDEM).
Amy Kwan is a compiler developer currently residing in Toronto, Canada. Originally, from the Canadian prairies, Amy holds a Bachelor of Science in Computer Science from the University of Saskatchewan. In her current role, she leverages LLVM technology as a backend compiler developer. Previously, Amy has been a speaker at the LLVM Developer Conference in 2022 alongside Kai Nacke.
Akash Kothari is a Research Assistant at the Illinois LLVM Compiler Research Lab. He earned his Ph.D. in Computer Science from the University of Illinois at Urbana-Champaign. Specializing in performance engineering, program synthesis, and formal semantics and verification, Akash’s interests extend to exploring the history of computing and programming systems.
Shuo Niu, a Master of Engineering in computer engineering, is a dynamic force in the realm of compiler technology. With five prolific years at Intel PSG specializing in FPGA HLD compilers, he led innovations in the compiler middle-end optimizer. His expertise in developing cutting-edge features has empowered users to achieve remarkable performance enhancements on FPGA boards.
Constructing a compiler is a complex and fascinating task. The LLVM project provides reusable components for your compiler and the LLVM core libraries implement a world-class optimizing code generator, which translates a source language-independent intermediate representation of machine code for all popular CPU architectures. The compilers for many programming languages already take advantage of LLVM technology.
This book teaches you how to implement your own compiler and how to use LLVM to achieve it. You will learn how the frontend of a compiler turns source code into an abstract syntax tree, and how to generate Intermediate Representation (IR) from it. Furthermore, you will also explore adding an optimization pipeline to your compiler, which allows you to compile the IR to performant machine code.
The LLVM framework can be extended in several ways, and you will learn how to add new passes, and even a completely new backend to LLVM. Advanced topics such as compiling for a different CPU architecture and extending clang and the clang static analyzer with your own plugins and checkers are also covered. This book follows a practical approach and is packed with example source code, which makes it easy to apply the gained knowledge within your own projects.
Learn LLVM 17 now features a new chapter dedicated to introducing the concept and syntax of the TableGen language used within LLVM, in which readers can leverage to define classes, records, and an entire LLVM backend. Furthermore, this book also presents an emphasis on backend development, which discusses various new backend concepts that can be implemented for an LLVM backend, such as implementing the GlobalISel instruction framework and developing machine function passes.
This book is for compiler developers, enthusiasts, and engineers who are interested in learning about the LLVM framework. It is also useful for C++ software engineers looking to use compiler-based tools for code analysis and improvement, as well as casual users of LLVM libraries who want to gain more knowledge of LLVM essentials. Intermediate-level experience with C++ programming is mandatory to understand the concepts covered in this book more effectively.
Chapter 1, Installing LLVM, explains how to set up and use your development environment. At the end of the chapter, you will have compiled the LLVM libraries and learned how to customize the build process.
Chapter 2, The Structure of a Compiler, gives you an overview of the components of a compiler. At the end of the chapter, you will have implemented your first compiler producing LLVM IR.
Chapter 3, Turning the Source File into an Abstract Syntax Tree, teaches you in detail how to implement the frontend of a compiler. You will create your own frontend for a small programming language, ending with the construction of an abstract syntax tree.
Chapter 4, Basics of IR Code Generation, shows you how to generate LLVM IR from an abstract syntax tree. At the end of the chapter, you will have implemented a compiler for the example language, emitting assembly text or object code files as a result.
Chapter 5, IR Generation for High-Level Language Constructs, illustrates how you translate source language features commonly found in high-level programming languages to LLVM IR. You will learn about the translation of aggregate data types, the various options to implement class inheritance and virtual functions, and how to comply with the application binary interface of your system.
Chapter 6, Advanced IR Generation, shows you how to generate LLVM IR for exception-handling statements in the source language. You will also learn how to add metadata for type-based alias analysis, and how to add debug information to the generated LLVM IR, and you will extend your compiler-generated metadata.
Chapter 7, Optimizing IR, explains the LLVM pass manager. You will implement your own pass, both as part of LLVM and as a plugin, and you will learn how to add your new pass to the optimizing pass pipeline.
Chapter 8, The TableGen Language, introduces LLVM’s own domain-specific language called TableGen. This language is used to reduce the coding effort of the developer, and you will learn about the different ways you can define data in the TableGen language, and how it can be leveraged in the backend.
Chapter 9, JIT Compilation, discusses how you can use LLVM to implement a just-in-time (JIT) compiler. By the end of the chapter, you will have implemented your own JIT compiler for LLVM IR in two different ways.
Chapter 10, Debugging Using LLVM Tools, explores the details of various libraries and components of LLVM, which helps you to identify bugs in your application. You will use the sanitizers to identify buffer overflows and other bugs. With the libFuzzer library, you will test functions with random data as input, and XRay will help you find performance bottlenecks. You will use the clang static analyzer to identify bugs at the source level, and you will learn that you can add your own checker to the analyzer. You will also learn how to extend clang with your own plugin.
Chapter 11,The Target Description, explains how you can add support for a new CPU architecture. This chapter discusses the necessary and optional steps like defining registers and instructions, developing instruction selection, and supporting the assembler and disassembler.
Chapter 12, Instruction Selection, demonstrates two different approaches to instruction selection, specifically explaining how SelectionDAG and GlobalISel work and showing how to implement these functionalities in a target, based on the example from the previous chapter. In addition, you will learn how to debug and test instruction selection.
Chapter 13, Beyond Instruction Selection, explains how you complete the backend implementation by exploring concepts beyond instruction selection. This includes adding new machine passes to implement target-specific tasks and points you to advanced topics that are not necessary for a simple backend but may be interesting for highly optimizing backends, such as cross-compilation to another CPU architecture.
You need a computer running Linux, Windows, Mac OS X, or FreeBSD, with the development toolchain installed for the operating system. Please see the table for the required tools. All tools should be in the search path of your shell.
Software/Hardware covered in the book
OS Requirements
A C/C++ compiler: gcc 7.1.0 or later, clang 5.0 or later, Apple clang 10.0 or later, Visual Studio 2019 16.7 or later
Linux (any), Windows, Mac OS X, or FreeBSD
CMake 3.20.0 or later
Ninja 1.11.1
Python 3.6 or later
Git 2.39.1 or later
To create the flame graph in Chapter 10, Debugging Using LLVM Tools, you need to install the scripts from https://github.com/brendangregg/FlameGraph. To run the script, you also need a recent version of Perl installed, and to view the graph you need a web browser capable of displaying SVG files, which all modern browsers do. To see the Chrome Trace Viewer visualization in the same chapter, you need to have the Chrome browser installed.
If you are using the digital version of this book, we advise you to type the code yourself or access the code via the GitHub repository (link available in the next section). Doing so will help you avoid any potential errors related to the copying and pasting of code.
You can download the example code files for this book from GitHub at https://github.com/PacktPublishing/Learn-LLVM-17. In case there’s an update to the code, it will be updated on the existing GitHub repository.
We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/. Check them out!
There are a number of text conventions used throughout this book.
Code in text: Indicates code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles. Here is an example: “You can observe in the code that a quantum circuit operation is being defined and a variable called numOnes.”
A block of code is set as follows:
#include "llvm/IR/IRPrintingPasses.h" #include "llvm/IR/LegacyPassManager.h" #include "llvm/Support/ToolOutputFile.h"When we wish to draw your attention to a particular part of a code block, the relevant lines or items are set in bold:
switch (Kind) { // Many more cases case m88k: return "m88k"; }Bold: Indicates a new term, an important word, or words that you see onscreen. For example, words in menus or dialog boxes appear in the text like this. Here is an example: “For development on OS X, it is best to install Xcode from the Apple store.”
Tips or important notes
Appear like this.
Feedback from our readers is always welcome.
General feedback: If you have questions about any aspect of this book, mention the book title in the subject of your message and email us at [email protected].
Errata: Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you have found a mistake in this book, we would be grateful if you would report this to us. Please visit www.packtpub.com/support/errata, selecting your book, clicking on the Errata Submission Form link, and entering the details.
Piracy: If you come across any illegal copies of our works in any form on the Internet, we would be grateful if you would provide us with the location address or website name. Please contact us at [email protected] with a link to the material.
If you are interested in becoming an author: If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, please visit authors.packtpub.com.
Once you’ve read Learn LLVM 17, we’d love to hear your thoughts! Please click here to go straight to the Amazon review page for this book and share your feedback.
Your review is important to us and the tech community and will help us make sure we’re delivering excellent quality content.
Thanks for purchasing this book!
Do you like to read on the go but are unable to carry your print books everywhere?
Is your eBook purchase not compatible with the device of your choice?
Don’t worry, now with every Packt book you get a DRM-free PDF version of that book at no cost.
Read anywhere, any place, on any device. Search, copy, and paste code from your favorite technical books directly into your application.
The perks don’t stop there, you can get exclusive access to discounts, newsletters, and great free content in your inbox daily
Follow these simple steps to get the benefits:
Scan the QR code or visit the link belowhttps://packt.link/free-ebook/9781837631346
Submit your proof of purchaseThat’s it! We’ll send your free PDF and other benefits to your email directlyIn this section, you will learn how to compile LLVM by yourself and tailor the build to your needs. You will understand how LLVM projects are organized, and you will create your first project utilizing LLVM. Finally, you will explore the overall structure of a compiler, while creating a small compiler yourself.
This section comprises the following chapters:
Chapter 1, Installing LLVMChapter 2, The Structure of a CompilerIn order to learn how to work with LLVM, it is best to begin by compiling LLVM from source. LLVM is an umbrella project and the GitHub repository contains the source of all projects belonging to LLVM. Each LLVM project is in a top-level directory of the repository. Besides cloning the repository, your system must also have all tools installed that are required by the build system. In this chapter, you will learn about the following topics:
Getting the prerequisites ready, which will show you how to set up your build systemCloning the repository and building from source, which will cover how to get the LLVM source code, and how to compile and install the LLVM core libraries and clang with CMake and NinjaCustomizing the build process, which will talk about the various possibilities for influencing the build processYou can install LLVM binaries from various sources. If you are using Linux, then your distribution contains the LLVM libraries. Why bother compiling LLVM yourself?
First, not all install packages contain all the files required for developing with LLVM. Compiling and installing LLVM yourself prevents this problem. Another reason stems from the fact that LLVM is highly customizable. With building LLVM, you learn how you can customize LLVM, and this will enable you to diagnose problems that may arise if you bring your LLVM application to another platform. And last, in the third part of this book, you will extend LLVM itself, and for this, you need the skill of building LLVM yourself.
However, it is perfectly fine to avoid compiling LLVM for the first steps. If you want to go on this route, then you only need to install the prerequisites as described in the next section.
Note
Many Linux distributions split LLVM into several packages. Please make sure that you install the development package. For example, on Ubuntu, you need to install the llvm-dev package. Please also make sure that you install LLVM 17. For other versions, the examples in this book may require changes.
To work with LLVM, your development system should run a common operating system such as Linux, FreeBSD, macOS, or Windows. You can build LLVM and clang in different modes. A build with debug symbols enabled can take up to 30 GB of space. The required disk space depends heavily on the chosen build options. For example, building only the LLVM core libraries in release mode, targeting only one platform, requires about 2 GB of free disk space, which is the bare minimum needed.
To reduce compile times, a fast CPU (such as a quad-core CPU with a 2.5 GHz clock speed) and a fast SSD are also helpful. It is even possible to build LLVM on a small device such as a Raspberry Pi – it only takes a lot of time. The examples within this book were developed on a laptop with an Intel quad-core CPU running at a 2.7 GHz clock speed, with 40 GB RAM and 2.5 TB SSD disk space. This system is well suited for the development task.
Your development system must have some prerequisite software installed. Let’s review the minimal required version of these software packages.
To check out the source from GitHub, you need Git (https://git-scm.com/). There is no requirement for a specific version. The GitHub help pages recommend using at least version 1.17.10. Due to known security issues found in the past, it is recommended to use the latest available version, which is 2.39.1 at the time of writing.
The LLVM project uses CMake (https://cmake.org/) as the build file generator. At least the 3.20.0 version is required. CMake can generate build files for various build systems. In this book, Ninja (https://ninja-build.org/) is used because it is fast and available on all platforms. The latest version, 1.11.1, is recommended.
Obviously, you also need a C/C++ compiler. The LLVM projects are written in modern C++, based on the C++17 standard. A conforming compiler and standard library are required. The following compilers are known to work with LLVM 17:
gcc 7.1.0 or laterclang 5.0 or laterApple clang 10.0 or laterVisual Studio 2019 16.7 or laterTip
Please be aware that with further development of the LLVM project, the requirements for the compiler are most likely to change. In general, you should use the latest compiler version available for your system.
Python (https://python.org/) is used during the generation of the build files and for running the test suite. It should be at least the 3.8 version.
Although not covered in this book, there can be reasons that you need to use Make instead of Ninja. In this case, you need to use GNU Make (https://www.gnu.org/software/make/) version 3.79 or later. The usage of both build tools is very similar. It is sufficient to replace ninja in each command with make for the scenarios described below.
LLVM also depends on the zlib library (https://www.zlib.net/). You should have at least version 1.2.3.4 installed. As usual, we recommend using the latest version, 1.2.13.
To install the prerequisite software, the easiest way is to use the package manager from your operating system. In the following sections, the commands required to install the software are shown for the most popular operating systems.
Ubuntu 22.04 uses the apt package manager. Most of the basic utilities are already installed; only the development tools are missing. To install all packages at once, you type the following:
$ sudo apt -y install gcc g++ git cmake ninja-build zlib1g-devThe package manager of Fedora 37 and RedHat Enterprise Linux 9 is called dnf. Like Ubuntu, most of the basic utilities are already installed. To install all packages at once, you type the following:
$ sudo dnf –y install gcc gcc-c++ git cmake ninja-build \ zlib-develOn FreeBSD 13 or later, you have to use the pkg package manager. FreeBSD differs from Linux-based systems in that the clang compiler is already installed. To install all other packages at once, you type the following:
$ sudo pkg install –y git cmake ninja zlib-ngFor development on OS X, it is best to install Xcode from the Apple store. While the Xcode IDE is not used in this book, it comes with the required C/C++ compilers and supporting utilities. For installation of the other tools, the package manager Homebrew (https://brew.sh/) can be used. To install all packages at once, you type the following:
$ brew install git cmake ninja zlibLike OS X, Windows does not come with a package manager. For the C/C++ compiler, you need to download Visual Studio Community 2022 (https://visualstudio.microsoft.com/vs/community/), which is free for personal use. Please make sure that you install the workload named Desktop Development with C++. You can use the package manager Scoop (https://scoop.sh/) to install the other packages. After installing Scoop as described on the website, you open x64 Native Tools Command Prompt for VS 2022 from your Windows menu. To install the required packages, you type the following:
$ scoop install git cmake ninja python gzip bzip2 coreutils $ scoop bucket add extras $ scoop install zlibPlease watch the output from Scoop closely. For the Python and zlib packages, it advises adding some registry keys. These entries are needed so that other software can find these packages. To add the registry keys, you’d best copy and paste the output from Scoop, which looks like the following:
$ %HOMEPATH%\scoop\apps\python\current\install-pep-514.reg $ %HOMEPATH%\scoop\apps\zlib\current\register.regAfter each command, a message window from the registry editor will pop up asking whether you really want to import those registry keys. You need to click on Yes to finish the import. Now all prerequisites are installed.
For all examples in this book, you must use the x64 Native Tools Command Prompt for VS 2022. Using this command prompt, the compiler is automatically added to the search path.
Tip
The LLVM code base is very large. To comfortably navigate the source, we recommend using an IDE that allows you to jump to the definition of classes and search through the source. We find Visual Studio Code (https://code.visualstudio.com/download), which is an extensible cross-platform IDE, very comfortable to use. However, this is no requirement for following the examples in this book.
With the build tools ready, you can now check out all LLVM projects from GitHub and build LLVM. This process is essentially the same on all platforms:
Configure Git.Clone the repository.Create the build directory.Generate the build system files.Finally, build and install LLVM.Let’s begin with configuring Git.
The LLVM project uses Git for version control. If you have not used Git before, then you should do some basic configuration of Git first before continuing: to set the username and email address. Both pieces of information are used if you commit changes.
One can check whether they previously had an email and username already configured in Git with the following commands:
$ git config user.email $ git config user.nameThe preceding commands will output the respective email and username that you already have set when using Git. However, in the event that you are setting the username and email for the first time, the following commands can be entered for first-time configuration. In the following commands, you can simply replace Jane with your name and [email protected] with your email:
$ git config --global user.email "[email protected]" $ git config --global user.name "Jane"These commands change the global Git configuration. Inside a Git repository, you can locally overwrite those values by not specifying the --global option.
By default, Git uses the vi editor for commit messages. If you prefer another editor, then you can change the configuration in a similar way. To use the nano editor, you type the following:
$ git config --global core.editor nanoFor more information about Git, please see the Git Version Control Cookbook (https://www.packtpub.com/product/git-version-control-cookbook-second-edition/9781789137545).
Now you are ready to clone the LLVM repository from GitHub.
The command to clone the repository is essentially the same on all platforms. Only on Windows, it is recommended to turn off the auto-translation of line endings.
On all non-Windows platforms, you type the following command to clone the repository:
$ git clone https://github.com/llvm/llvm-project.gitOnly on Windows, add the option to disable auto-translation of line endings. Here, you type the following:
$ git clone --config core.autocrlf=false \ https://github.com/llvm/llvm-project.gitThis Git command clones the latest source code from GitHub into a local directory named llvm-project. Now change the current directory into the new llvm-project directory with the following command:
$ cd llvm-projectInside the directory are all LLVM projects, each one in its own directory. Most notably, the LLVM core libraries are in the llvm subdirectory. The LLVM project uses branches for subsequent release development (“release/17.x”) and tags (“llvmorg-17.0.1”) to mark a certain release. With the preceding clone command, you get the current development state. This book uses LLVM 17. To check out the first release of LLVM 17 into a branch called llvm-17, you type the following:
$ git checkout -b llvm-17 llvmorg-17.0.1With the previous steps, you cloned the whole repository and created a branch from a tag. This is the most flexible approach.
Git also allows you to clone only a branch or a tag (including history). With git clone --branch release/17.x https://github.com/llvm/llvm-project, you only clone the release/17.x branch and its history. You then have the latest state of the LLVM 17 release branch, so you only need to create a branch from the release tag like before if you need the exact release version. With the additional –-depth=1 option, which is known as a shallow clone with Git, you prevent the cloning of the history, too. This saves time and space but obviously limits what you can do locally, including checking out a branch based on the release tags.
Unlike many other projects, LLVM does not support inline builds and requires a separate build directory. Most easily, this is created inside the llvm-project directory, which is your current directory. Let us name the build directory, build, for simplicity. Here, the commands for Unix and Windows systems differ. On a Unix-like system, you use the following:
$ mkdir buildAnd on Windows, use the following:
$ md buildNow you are ready to create the build system files with the CMake tool inside this directory.
In order to generate build system files to compile LLVM and clang using Ninja, you run the following:
$ cmake -G Ninja -DCMAKE_BUILD_TYPE=Release \ -DLLVM_ENABLE_PROJECTS=clang -B build -S llvmThe -G option tells CMake for which system to generate build files. Often-used values for that option are as follows:
Ninja – for the Ninja build systemUnix Makefiles – for GNU MakeVisual Studio 17 VS2022 – for Visual Studio and MS BuildXcode – for Xcode projectsWith the –B option, you tell CMake the path of the build directory. Similarly, you specify the source directory with the –S option. The generation process can be influenced by setting various variables with the –D option. Usually, they are prefixed with CMAKE_ (if defined by CMake) or LLVM_ (if defined by LLVM).
As mentioned previously, we are also interested in compiling clang alongside LLVM. With the LLVM_ENABLE_PROJECTS=clang variable setting, this allows CMake to generate the build files for clang in addition to LLVM. Furthermore, the CMAKE_BUILD_TYPE=Release variable tells CMake that it should generate build files for a release build.
The default value for the –G option depends on your platform, and the default value for the build type depends on the toolchain. However, you can define your own preference with environment variables. The CMAKE_GENERATOR variable controls the generator, and the CMAKE_BUILD_TYPE variable specifies the build type. If you use bash or a similar shell, then you can set the variables with the following:
$ export CMAKE_GENERATOR=Ninja $ export CMAKE_BUILD_TYPE=ReleaseIf you are using the Windows command prompt instead, then you set the variables with the following:
$ set CMAKE_GENERATOR=Ninja $ set CMAKE_BUILD_TYPE=ReleaseWith these settings, the command to create the build system files becomes the following, which is easier to type:
$ cmake -DLLVM_ENABLE_PROJECTS=clang -B build -S llvmYou will find more about CMake variables in the Customizing the build process section.
After the build files are generated, LLVM and clang can be compiled with the following:
$ cmake –-build buildThis command runs Ninja under the hood because we told CMake to generate Ninja files in the configuration step. However, if you generate build files for a system such as Visual Studio, which supports multiple build configurations, then you need to specify the configuration to use for the build with the --config option. Depending on the hardware resources, this command runs for between 15 minutes (server with lots of CPU cores, memory, and fast storage) and several hours (dual-core Windows notebook with limited memory).
By default, Ninja utilizes all available CPU cores. This is good for the speed of compilation but may prevent other tasks from running; for example, on a Windows-based notebook, it is almost impossible to surf the internet while Ninja is running. Fortunately, you can limit the resource usage with the –j option.
Let’s assume you have four CPU cores available and Ninja should only use two (because you have parallel tasks to run); you then use this command for compilation:
$ cmake --build build –j2After compilation is finished, a best practice is to run the test suite to check whether everything works as expected:
$ cmake --build build --target check-allAgain, the runtime of this command varies widely with the available hardware resources. The check-all Ninja target runs all test cases. Targets are generated for each directory containing test cases. Using check-llvm instead of check-all runs the LLVM tests but not the clang tests; check-llvm-codegen runs only the tests in the CodeGen directory from LLVM (that is, the llvm/test/CodeGen directory).
You can also do a quick manual check. One of the LLVM applications is llc, the LLVM compiler. If you run it with the -version option, it shows the LLVM version, the host CPU, and all supported architectures:
$ build/bin/llc --versionIf you have trouble getting LLVM compiled, then you should consult the Common Problems section of the Getting Started with the LLVM System documentation https://releases.llvm.org/17.0.1/docs/GettingStarted.html#common-problems) for solutions to typical problems.
As the last step, you can install the binaries:
$ cmake --install buildOn a Unix-like system, the install directory is /usr/local. On Windows, C:\Program Files\LLVM is used. This can be changed, of course. The next section explains how.
The CMake system uses a project description in the CMakeLists.txt file. The top-level file is in the llvm directory, llvm/CMakeLists.txt. Other directories also have CMakeLists.txt files, which are recursively included during the generation process.
Based on the information provided in the project description, CMake checks which compilers are installed, detects libraries and symbols, and creates the build system files, for example, build.ninja or Makefile (depending on the chosen generator). It is also possible to define reusable modules, for example, a function to detect whether LLVM is installed. These scripts are placed in the special cmake directory (llvm/cmake), which is searched automatically during the generation process.
The build process can be customized with the definition of CMake variables. The command-line option –D is used to set a variable to a value. The variables are used in the CMake scripts. Variables defined by CMake itself are almost always prefixed with CMAKE_ and these variables can be used in all projects. Variables defined by LLVM are prefixed with LLVM_ but they can only be used if the project definition includes the use of LLVM.
Some variables are initialized with the value of environment variables. Most notable are CC and CXX, which define the C and C++ compilers to be used for building. CMake tries to locate a C and a C++ compiler automatically, using the current shell search path. It picks the first compiler found. If you have several compilers installed, for example, gcc and clang or different versions of clang, then this might not be the compiler you want for building LLVM.
Suppose you like to use clang17 as a C compiler and clang++17 as a C++ compiler. Then, you can invoke CMake in a Unix shell in the following way:
$ CC=clang17 CXX=clang++17 cmake –B build –S llvmThis sets the value of the environment variables only for the invocation of cmake. If necessary, you can specify an absolute path for the compiler executables.
CC is the default value of the CMAKE_C_COMPILER CMake variable, and CXX is the default value of the CMAKE_CXX_COMPILER CMake variable. Instead of using the environment variables, you can set the CMake variables directly. This is equivalent to the preceding call:
$ cmake –DCMAKE_C_COMPILER=clang17 \ -DCMAKE_CXX_COMPILER=clang++17 –B build –S llvmOther useful variables defined by CMake are as follows:
Variable name
Purpose
CMAKE_INSTALL_PREFIX
This is a path prefix that is prepended to every path during installation. The default is /usr/local on Unix and C:\Program Files\<Project> on Windows. To install LLVM in the /opt/llvm directory, you specify -DCMAKE_INSTALL_PREFIX=/opt/llvm. The binaries are copied to /opt/llvm/bin, library files to /opt/llvm/lib, and so on.
CMAKE_BUILD_TYPE
Different types of build require different settings. For example, a debug build needs to specify options to generate debug symbols and usually link against debug versions of system libraries. In contrast, a release build uses optimization flags and links against production versions of libraries. This variable is only used for build systems that can only handle one build type, for example, Ninja or Make. For IDE build systems, all variants are generated and you have to use the mechanism of the IDE to switch between build types. Possible values are as follows:
DEBUG: build with debug symbols
RELEASE: build with optimization for speed
RELWITHDEBINFO: release build with debug symbols
MINSIZEREL: build with optimization for size
The default build type is taken from the CMAKE_BUILD_TYPE environment variable. If this variable is not set, then the default depends on the used toolchain and is often empty. In order to generate build files for a release build, you specify -DCMAKE_BUILD_TYPE=RELEASE.
CMAKE_C_FLAGS
CMAKE_CXX_FLAGS
These are extra flags used when compiling C and C++ source files. The initial values are taken from the CFLAGS and CXXFLAGS environment variables, which can be used as an alternative.
CMAKE_MODULE_PATH
This specifies additional directories that are searched for CMake modules. The specified directories are searched before the default ones. The value is a semicolon-separated list of directories.
PYTHON_EXECUTABLE
If the Python interpreter is not found or if the wrong one is picked in case you have installed multiple versions, you can set this variable to the path of the Python binary. This variable only has an effect if the Python module of CMake is included (which is the case for LLVM).
Table 1.1 - Additional useful variables provided by CMake
CMake provides built-in help for variables. The --help-variable var option prints help for the var variable. For instance, you can type the following to get help for CMAKE_BUILD_TYPE:
$ cmake --help-variable CMAKE_BUILD_TYPEYou can also list all variables with the following command:
$ cmake --help-variable-listThis list is very long. You may want to pipe the output to more or a similar program.
The build configuration variables defined by LLVM work in the same way as those defined by CMake except that there is no built-in help. The most useful variables are found in the following tables, where they are divided into variables that are useful for first-time users installing LLVM, and also variables for more advanced LLVM users.
Variable name
Purpose
LLVM_TARGETS_TO_BUILD
LLVM supports code generation for different CPU architectures. By default, all these targets are built. Use this variable to specify the list of targets to build, separated by semicolons. The current targets are AArch64, AMDGPU, ARM, AVR,BPF, Hexagon, Lanai, LoongArch, Mips, MSP430, NVPTX, PowerPC, RISCV, Sparc, SystemZ, VE, WebAssembly, X86, and XCore. all can be used as shorthand for all targets. The names are case-sensitive. To only enable the PowerPC and the System Z target, you specify -DLLVM_TARGETS_TO_BUILD="PowerPC;SystemZ".
LLVM_EXPERIMENTAL_TARGETS_TO_BUILD
In addition to the official targets, the LLVM source tree also contains experimental targets. These targets are under development and often do not yet support the full functionality of a backend. The current list of experimental targets is ARC, CSKY, DirectX, M68k, SPIRV, and Xtensa. To build the M68k target, you specify -D LLVM_EXPERIMENTAL_TARGETS_TO_BUILD=M68k.
LLVM_ENABLE_PROJECTS
This is a list of the projects you want to build, separated by semicolons. The source for the projects must be on the same level as the llvm directory (side-by-side layout). The current list is bolt, clang, clang-tools-extra, compiler-rt, cross-project-tests, libc, libclc, lld, lldb, mlir, openmp, polly, and pstl. all can be used as shorthand for all projects in this list. Additionally, you can specify the flang project here. Due to some special build requirements, it is not yet part of the all list.
To build clang and bolt together with LLVM, you specify -DLLVM_ENABLE_PROJECT="clang;bolt".
Table 1.2 - Useful variables for first-time LLVM users
LLVM_ENABLE_ASSERTIONS
If set to ON, then assertion checks are enabled. These checks help to find errors and are very useful during development. The default value is ON for a DEBUG build and otherwise OFF. To turn assertion checks on (e.g. for a RELEASE build), you specify –DLLVM_ENABLE_ASSERTIONS=ON.
LLVM_ENABLE_EXPENSIVE_CHECKS
This enables some expensive checks that can really slow down compilation speed or consume large amounts of memory. The default value is OFF. To turn these checks on, you specify -DLLVM_ENABLE_EXPENSIVE_CHECKS=ON.
LLVM_APPEND_VC_REV
LLVM tools such as llc display the LLVM version they are based on besides other information if the –version command-line option is given. This version information is based on the LLVM_REVISION C macro. By default, not only the LLVM version but also the current Git hash is part of the version information. This is handy in case you are following the development of the master branch because it makes clear on which Git commit the tool is based. If not needed, then this can be turned off with –DLLVM_APPEND_VC_REV=OFF.
LLVM_ENABLE_THREADS
LLVM automatically includes thread support if a threading library is detected (usually the pthreads library). Further, LLVM assumes in this case that the compiler supports TLS (thread-local storage). If you don’t want thread support or your compiler does not support TLS, then you can turn it off with -DLLVM_ENABLE_THREADS=OFF.
LLVM_ENABLE_EH
The LLVM projects do not use C++ exception handling and therefore turn exception support off by default. This setting can be incompatible with other libraries your project is linking with. If needed, you can enable exception support by specifying –DLLVM_ENABLE_EH=ON.
LLVM_ENABLE_RTTI
LLVM uses a lightweight, self-build system for runtime type information. The generation of C++ RTTI is turned off by default. Like the exception handling support, this may be incompatible with other libraries. To turn generation of C++ RTTI on, you specify –DLLVM_ENABLE_RTTI=ON.
LLVM_ENABLE_WARNINGS
Compiling LLVM should generate no warning messages if possible. The option to print warning messages is therefore turned on by default. To turn it off, you specify –DLLVM_ENABLE_WARNINGS=OFF.
LLVM_ENABLE_PEDANTIC
The LLVM source should be C/C++ language standard-conforming; hence, pedantic checking of the source is enabled by default. If possible, compiler-specific extensions are also disabled. To reverse this setting, you specify –DLLVM_ENABLE_PEDANTIC=OFF.
LLVM_ENABLE_WERROR
If set to ON, then all warnings are treated as errors – the compilation aborts as soon as warnings are found. It helps to find all remaining warnings in the source. By default, it is turned off. To turn it on, you specify –DLLVM_ENABLE_WERROR=ON.
LLVM_OPTIMIZED_TABLEGEN
Usually, the tablegen tool is built with the same options as all other parts of LLVM. At the same time, tablegen is used to generate large parts of the code generator. As a result, tablegen is much slower in a debug build, increasing the compile time noticeably. If this option is set to ON, then tablegen is compiled with optimization turned on even for a debug build, possibly reducing compile time. The default is OFF. To turn it on, you specify –DLLVM_OPTIMIZED_TABLEGEN=ON.
LLVM_USE_SPLIT_DWARF
If the build compiler is gcc or clang, then turning on this option will instruct the compiler to generate the DWARF debug information in a separate file. The reduced size of the object files reduces the link time of debug builds significantly. The default is OFF. To turn it on, you specify -LLVM_USE_SPLIT_DWARF=ON.
Table 1.3 - Useful variables for advanced LLVM users
Note
LLVM defines many more CMake variables. You can find the complete list in the LLVM documentation about CMake https://releases.llvm.org/17.0.1/docs/CMake.html#llvm-specific-variables. The preceding list contains only the ones you are most likely to need.
In this chapter, you prepared your development machine to compile LLVM. You cloned the GitHub repository and compiled your own version of LLVM and clang. The build process can be customized with CMake variables. You learned about useful variables and how to change them. Equipped with this knowledge, you can tweak LLVM to your needs.
In the next section, we will be taking a closer look at the structure of a compiler. We will be exploring the different components found inside the compiler, as well as different types of analyses that occur in it – specifically, the lexical, syntactical, and semantic analyses. Finally, we will also briefly touch on interfacing with an LLVM backend for code generation.
Compiler technology is a well-studied field of computer science. The high-level task is to translate a source language into machine code. Typically, this task is divided into three parts, the frontend, the middle end, and the backend. The frontend deals mainly with the source language, while the middle end performs transformation to improve the code and the backend is responsible for the generation of machine code. Since the LLVM core libraries provide the middle end and the backend, we will focus on the frontend within this chapter.
In this chapter, you will cover the following sections and topics:
Building blocks of a compiler, in which you will learn about the components typically found in a compilerAn arithmetic expression language, which will introduce you to an example language and show how grammar is used to define a languageLexical analysis, which discusses how to implement a lexer for the languageSyntactical analysis, which covers the construction of a parser from the grammarSemantic analysis, in which you will learn how a semantic check can be implementedCode generation with the LLVM backend, which discusses how to interface with the LLVM backend and glue all the preceding phases together to create a complete compilerSince computers became available, thousands of programming languages have been developed. It turns out that all compilers must solve the same tasks and that the implementation of a compiler is best structured according to these tasks. At a high level, there are three components. The frontend turns the source code into an intermediate representation (IR). Then the middle end performs transformations on the IR, with the goal of either improving performance or reducing the size of the code. Finally, the backend produces machine code from the IR. The LLVM core libraries provide a middle end consisting of very sophisticated transformations and backends for all popular platforms. Furthermore, the LLVM core libraries also defines an intermediate representation used as input for the middle end and the backend. This design has the advantage that you only need to care about the frontend for the programming language you want to implement.
The input for the frontend is the source code, usually a text file. To make sense of it, the frontend first identifies the words of the language, such as numbers and identifiers, which are usually called tokens. This step is performed by the lexer. Next, the syntactical structure formed by the tokens is analyzed. The so-called parser performs this step, and the result is the abstract syntax tree (AST). Last, the frontend needs to check that the rules of the programming language are obeyed, which is done by the semantic analyzer. If no errors were detected, then the AST is transformed into IR and handed over to the middle end.
In the following sections, we will construct a compiler for an expression language, which produces LLVM IR from its input. The LLVM llc static compiler, representing the backend, can then be used to compile the IR into object code. It all begins with defining the language. Keep in mind that all of the C++ implementation of the files within this chapter will be contained within a directory called src/.
Arithmetic expressions are a part of every programming language. Here is an example of an arithmetic expression calculation language called calc. The calc expressions are compiled into an application that evaluates the following expression:
with a, b: a * (4 + b)The used variables in the expression must be declared with the keyword, with. This program is compiled into an application that asks the user for the values of the a and b variables and prints the result.
Examples are always welcome but, as a compiler writer, you need a more thorough specification than this for implementation and testing. The vehicle for the syntax of the programming language is the grammar.
The elements of a language, for example, keywords, identifiers, strings, numbers, and operators, are called tokens. In this sense, a program is a sequence of tokens, and the grammar specifies which sequences are valid.
Usually, grammar is written in the extended Backus-Naur form (EBNF). A rule in grammar has a left and a right side. The left side is just a single symbol called non-terminal. The right side of a rule consists of non-terminals, tokens, and meta-symbols for alternatives and repetitions. Let’s have a look at the grammar of the calc language:
calc : ("with" ident ("," ident)* ":")? expr ; expr : term (( "+" | "-" ) term)* ; term : factor (( "*" | "/") factor)* ; factor : ident | number | "(" expr ")" ; ident : ([a-zAZ])+ ; number : ([0-9])+ ;In the first line, calc is a non-terminal. If not otherwise stated, then the first non-terminal of a grammar is the start symbol. The colon (:) is the separator between the left and the right side of the rule. Here, "with", "," and ":" are tokens that represent this string. Parentheses are used for grouping. A group can be optional or repeated. A question mark (?) after the closing parenthesis denotes an optional group. A star * denotes zero or more repetitions and a plus + denotes one or more repetitions. Ident and expr are non-terminals. For each of them, another rule exists. The semicolon (;) marks the end of a rule. The pipe |, in the second line, denotes an alternative. And last, the brackets [ ], in the last two lines, denote a character class. The valid characters are written inside the brackets. For example, the character class [a-zA-Z] matches an upper- or lower-case letter, and ([a-zA-Z])+ matches one or more of these letters. This corresponds to a regular expression.
Such grammar may look like a theoretical toy, but it is of value to the compiler writer. First, all the tokens are defined, which is needed to create the lexical analyzer. The rules of the grammar can be translated into the parser. And of course, if questions arise about whether the parser works correctly, then the grammar serves as a good specification.
However, grammar does not define all aspects of a programming language. The meaning – the semantics – of the syntax must also be defined. Formalisms for this purpose were developed, too, but very often, they are specified in plain text, as they were usually drawn up at the initial introduction of the language.
Equipped with this knowledge, the next two sections show how the lexical analysis turns the input into a sequence of tokens and how the grammar is coded in C++ for the syntactical analysis.
As already seen in the example in the previous section, a programming language consists of many elements such as keywords, identifiers, numbers, operators, and so on. The task of the lexical analyzer is to take the textual input and create a sequence of tokens from it. The calc language consists of the tokens with, :, +, -, *, /, (, ), and regular expressions ([a-zA-Z])+ (an identifier) and ([0-9])+ (a number). We assign a unique number to each token to make the handling of tokens easier.
The implementation of a lexical analyzer is often called Lexer. Let’s create a header file called Lexer.h and get started with the definition of Token. It begins with the usual header guard and the inclusion of the required headers:
#ifndef LEXER_H #define LEXER_H #include "llvm/ADT/StringRef.h" #include "llvm/Support/MemoryBuffer.h"