39,59 €
LLVM was built to bridge the gap between compiler textbooks and actual compiler development. It provides a modular codebase and advanced tools which help developers to build compilers easily. This book provides a practical introduction to LLVM, gradually helping you navigate through complex scenarios with ease when it comes to building and working with compilers.
You’ll start by configuring, building, and installing LLVM libraries, tools, and external projects. Next, the book will introduce you to LLVM design and how it works in practice during each LLVM compiler stage: frontend, optimizer, and backend. Using a subset of a real programming language as an example, you will then learn how to develop a frontend and generate LLVM IR, hand it over to the optimization pipeline, and generate machine code from it. Later chapters will show you how to extend LLVM with a new pass and how instruction selection in LLVM works. You’ll also focus on Just-in-Time compilation issues and the current state of JIT-compilation support that LLVM provides, before finally going on to understand how to develop a new backend for LLVM.
By the end of this LLVM book, you will have gained real-world experience in working with the LLVM compiler development framework with the help of hands-on examples and source code snippets.
Das E-Book können Sie in Legimi-Apps oder einer beliebigen App lesen, die das folgende Format unterstützen:
Seitenzahl: 459
Veröffentlichungsjahr: 2021
A beginner's guide to learning LLVM compiler tools and core libraries with C++
Kai Nacke
BIRMINGHAM—MUMBAI
Copyright © 2021 Packt Publishing
All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.
Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing or its dealers and distributors, will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book.
Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.
Group Product Manager: Aaron Lazar
Publishing Product Manager: Shweta Bairoliya
Senior Editor: Ruvika Rao
Content Development Editor: Nithya Sadanandan
Technical Editor: Gaurav Gala
Copy Editor: Safis Editing
Project Coordinator: Deeksha Thakkar
Proofreader: Safis Editing
Indexer: Manju Arasan
Production Designer: Roshan Kawale
First published: April 2021
Production reference: 1290421
Published by Packt Publishing Ltd.
Livery Place
35 Livery Street
Birmingham
B3 2PB, UK.
978-1-83921-350-2
www.packt.com
Writing a book is a challenging task, especially when you are planning to move to Canada and suddenly a pandemic hits the world and changes everything. The team at Packt not only provided guidance on writing, but also showed understanding for my slow writing, and always motivated me to carry on. I owe them a great thank you.
Without the support of my family, this book would not have been possible. Thanks for putting all that faith in me!
Kai Nacke is a professional IT architect currently living in Toronto, Canada. He holds a diploma in computer science from the Technical University of Dortmund, Germany. His diploma thesis about universal hash functions was recognized as the best of the semester.
He has been working in the IT industry for more than 20 years and has great experience in the development and architecture of business and enterprise applications. In his current role, he evolves an LLVM/Clang-based compiler.
For some years, he was the maintainer of LDC, the LLVM-based D compiler. He is the author of D Web Development, published by Packt. In the past, he was also a speaker in the LLVM developer room at the Free and Open Source Software Developers' European Meeting (FOSDEM).
Suyog Sarda is a professional software engineer and an open source enthusiast. He focuses on compiler development and compiler tools. He is an active contributor to the LLVM open source community. Suyog was also involved in code performance improvements for the ARM and x86 architectures. He has been a part of the compiler team for the Tizen project. His interest in compiler development lies more in code optimization and vectorization. Previously, he has authored a book on LLVM, titled LLVM Cookbook, published by Packt. Apart from compilers, Suyog is also interested in Linux kernel development. He published a technical paper titled Secure Co-resident Virtualization in Multicore Systems by VM Pinning and Page Coloring at the IEEE Proceedings of the 2012 International Conference on Cloud Computing, Technologies, Applications, and Management at the Birla Institute of Technology, Dubai. He earned a bachelor's degree in computer technology from the College of Engineering, Pune, India.
Constructing a compiler is a complex and fascinating task. The LLVM project provides reusable components for your compiler. The LLVM core libraries implement a world-class optimizing code generator, which translates a source language-independent intermediate representation of machine code for all popular CPU architectures. The compilers for many programming languages already take advantage of LLVM technology.
This book teaches you how to implement your own compiler and how to use LLVM to achieve it. You will learn how the frontend of a compiler turns source code into an abstract syntax tree, and how to generate intermediate representation (IR) from it. Adding an optimization pipeline to your compiler, you can compile the IR to performant machine code.
The LLVM framework can be extended in several ways, and you will learn how to add new passes, new machine instructions, and even a completely new backend to LLVM. Advanced topics such as compiling for a different CPU architecture and extending clang and the clang static analyzer with your own plugins and checkers are also covered. This book follows a practical approach and is packed with example source code, which makes it easy to apply the knowledge gained in your own projects.
This book is for compiler developers, enthusiasts, and engineers who are new to LLVM and are interested in learning about the LLVM framework. It is also useful for C++ software engineers looking to use compiler-based tools for code analysis and improvement, as well as casual users of LLVM libraries who want to gain more knowledge of LLVM essentials. Intermediate-level experience with C++ programming is mandatory to understand the concepts covered in this book more effectively.
Chapter 1, Installing LLVM, explains how to set up and use your development environment. At the end of the chapter, you will have compiled the LLVM libraries and learned how to customize the build process.
Chapter 2, Touring the LLVM Source, introduces you to the various LLVM projects and discusses the common directory layout shared by all projects. You will create your first project using the LLVM core libraries, and you will also compile it for a different CPU architecture.
Chapter 3, The Structure of a Compiler, gives you an overview of the components of a compiler. At the end of the chapter, you will have implemented your first compiler producing LLVM IR.
Chapter 4, Turning the Source File into an Abstract Syntax Tree, teaches you in detail how to implement the frontend of a compiler. You will create your own frontend for a small programming language, ending with the construction of an abstract syntax tree.
Chapter 5, Basics of IR Generation, shows you how to generate LLVM IR from an abstract syntax tree. At the end of the chapter, you will have implemented a compiler for the example language, emitting assembly text or object code files as a result.
Chapter 6, IR Generation for High-Level Language Constructs, illustrates how you translate source language features commonly found in high-level programming languages to LLVM IR. You will learn about the translation of aggregate data types, the various options to implement class inheritance and virtual functions, and how to comply with the application binary interface of your system.
Chapter 7, Advanced IR Generation, shows you how to generate LLVM IR for exception-handling statements in the source language. You will also learn how to add metadata for type-based alias analysis, and how to add debug information to the generated LLVM IR, and you will extend your compiler-generated metadata.
Chapter 8, Optimizing IR, explains the LLVM pass manager. You will implement your own pass, both as part of LLVM and as a plugin, and you will learn how to add your new pass to the optimizing pass pipeline.
Chapter 9, Instruction Selection, shows how LLVM lowers IR to machine instructions. You will learn how instructions are defined in LLVM, and you will add a new machine instruction to LLVM so that instruction selection takes the new instruction into account.
Chapter 10, JIT Compilation, discusses how you can use LLVM to implement a just-in-time (JIT) compiler. By the end of the chapter, you will have implemented your own JIT compiler for LLVM IR in two different ways.
Chapter 11, Debugging Using LLVM Tools, explores the details of various libraries and components of LLVM, which helps you to identify bugs in your application. You will use the sanitizers to identify buffer overflows and other bugs. With the libFuzzer library, you will test functions with random data as input, and XRay will help you to find performance bottlenecks. You will use the clang static analyzer to identify bugs at the source level, and you will learn that you can add your own checker to the analyzer. You will also learn how to extend clang with your own plugin.
Chapter 12, Creating Your Own Backend, explains how you can add a new backend to LLVM. You will implement all the necessary classes, and at the end of the chapter you will compile LLVM IR to yet another CPU architecture.
You need a computer running Linux, Windows, macOS, or FreeBSD, with the development toolchain installed for the operating system. Please see the table for the required tools. All tools should be in the search path of your shell.
To view the DAG visualization in Chapter 9, Instruction Selection, you must have the Graphviz software from https://graphviz.org/ installed. By default, the generated image is in PDF format, and you need a PDF viewer to show it.
To create the flame graph in Chapter 11, Debugging Using LLVM Tools, you need to install the scripts from https://github.com/brendangregg/FlameGraph. To run the script, you also need a recent version of Perl installed, and to view the graph you need a web browser capable of displaying SVG files, which all modern browsers do. To see the Chrome Trace Viewer visualization in the same chapter, you need to have the Chrome browser installed.
If you are using the digital version of this book, we advise you to type the code yourself or access the code via the GitHub repository (link available in the next section). Doing so will help you avoid any potential errors related to the copying and pasting of code.
You can download the example code files for this book from GitHub at https://github.com/PacktPublishing/Learn-LLVM-12. In case there's an update to the code, it will be updated on the existing GitHub repository.
We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/. Check them out!
Code in Action videos for this book can be viewed at https://bit.ly/3nllhED
We also provide a PDF file that has color images of the screenshots/diagrams used in this book. You can download it here: https://static.packt-cdn.com/downloads/9781839213502_ColorImages.pdf.
There are a number of text conventions used throughout this book.
Code in text: Indicates code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles. Here is an example: "You can observe in the code that a quantum circuit operation is being defined and a variable called numOnes."
A block of code is set as follows:
#include "llvm/IR/IRPrintingPasses.h"
#include "llvm/IR/LegacyPassManager.h"
#include "llvm/Support/ToolOutputFile.h"
When we wish to draw your attention to a particular part of a code block, the relevant lines or items are set in bold:
switch (Kind) {
// Many more cases
case m88k: return "m88k";
}
Bold: Indicates a new term, an important word, or words that you see onscreen. For example, words in menus or dialog boxes appear in the text like this. Here is an example: "Select System info from the Administration panel."
Tips or important notes
Appear like this.
Feedback from our readers is always welcome.
General feedback: If you have questions about any aspect of this book, mention the book title in the subject of your message and email us at [email protected].
Errata: Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you have found a mistake in this book, we would be grateful if you would report this to us. Please visit www.packtpub.com/support/errata, selecting your book, clicking on the Errata Submission Form link, and entering the details.
Piracy: If you come across any illegal copies of our works in any form on the Internet, we would be grateful if you would provide us with the location address or website name. Please contact us at [email protected] with a link to the material.
If you are interested in becoming an author: If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, please visit authors.packtpub.com.
Please leave a review. Once you have read and used this book, why not leave a review on the site that you purchased it from? Potential readers can then see and use your unbiased opinion to make purchase decisions, we at Packt can understand what you think about our products, and our authors can see your feedback on their book. Thank you!
For more information about Packt, please visit packt.com.
In this section, you will learn how to compile LLVM by yourself, and how you can tailor the build to your needs. You will understand how LLVM projects are organized, and you will create your first project utilizing LLVM. You will also learn how to compile LLVM and applications using LLVM for a different CPU architecture. Finally, you will explore the overall structure of a compiler, while creating a small compiler yourself.
This section comprises the following chapters:
Chapter 1, Installing LLVMChapter 2, Touring the LLVM SourceChapter 3, The Structure of a CompilerTo learn how to work with LLVM, it is best to begin by compiling LLVM from the source. LLVM is an umbrella project, and its GitHub repository contains the sources for all the projects that belong to LLVM. Each LLVM project is in a top-level directory of the repository. Besides cloning the repository, your system must also have all tools that are required by the build system installed.
In this chapter, you will learn about the following topics:
Getting the prerequisites ready, which will show you how to set up your build system.Building with CMake, which will cover how to compile and install the LLVM core libraries and Clang with CMake and Ninja.Customizing the build process, which will talk about the various way we can influence the build process.To work with LLVM, your development system must run a common operating system such as Linux, FreeBSD, macOS, or Windows. Building LLVM and Clang with debug symbols enabled easily need tens of gigabytes of disk space, so be sure that your system has plenty of disk space available – in this scenario, you should have 30 GB of free space.
The required disk space depends heavily on the chosen build options. For example, building only the LLVM core libraries in release mode, while targeting only one platform, requires about 2 GB of free disk space, which is the bare minimum needed. To reduce compile times, a fast CPU (such as a quadcore CPU with 2.5 GHz clock speed) and a fast SSD would also be helpful.
It is even possible to build LLVM on a small device such as a Raspberry Pi – it just takes a lot of time to do so. I developed the examples in this book on a laptop with an Intel quadcore CPU running at 2.7 GHz clock speed, with 40 GB RAM and 2.5 TB SSD disk space. This system is well-suited for the development task at hand.
Your development system must have some prerequisite software installed. Let's review the minimal required versions of these software packages.
Note
Linux distributions often contain more recent versions that can be used. The version numbers are suitable for LLVM 12. Later versions of LLVM may require more recent versions of the packages mentioned here.
To check out the source from GitHub, you need git (https://git-scm.com/). There is no requirement for a specific version. The GitHub help pages recommend using at least version 1.17.10.
The LLVM project uses CMake (https://cmake.org/) as the build file generator. At least version 3.13.4 is required. CMake can generate build files for various build systems. In this book, Ninja (https://ninja-build.org/) is being used because it is fast and available on all platforms. The latest version, 1.9.0, is recommended.
Obviously, you also need a C/C++ compiler. The LLVM projects are written in modern C++, based on the C++14 standard. A conforming compiler and standard library are required. The following compilers are known to work with LLVM 12:
gcc 5.1.0 or laterClang 3.5 or laterApple Clang 6.0 or laterVisual Studio 2017 or laterPlease be aware that with further development of the LLVM project, the requirements for the compiler are most likely to change. At the time of writing, there are discussions to use C++17 and drop Visual Studio 2017 support. In general, you should use the latest compiler version available for your system.
Python (https://python.org/) is used to generate the build files and to run the test suite. It should be at least version 3.6.
Although not covered in this book, there may be reasons why you need to use Make instead of Ninja. In this case, you need to use GNU Make (https://www.gnu.org/software/make/) version 3.79 or later. The usage of both build tools is very similar. It is sufficient to replace ninja in each command with make for the scenarios described here.
To install the prerequisite software, the easiest thing to do is use the package manager from your operating system. In the following sections, the commands you must enter to install the software for the most popular operating systems are shown.
Ubuntu 20.04 uses the APT package manager. Most of the basic utilities are already installed; only the development tools are missing. To install all the packages at once, type the following:
$ sudo apt install –y gcc g++ git cmake ninja-build
The package manager for Fedora 33 and RedHat Enterprise Linux 8.3 is called DNF. Like Ubuntu, most of the basic utilities are already installed. To install all the packages at once, type the following:
$ sudo dnf install –y gcc gcc-c++ git cmake ninja-build
On FreeBSD 12 or later, you must use the PKG package manager. FreeBSD differs from Linux-based systems in that Clang is the preferred compiler. To install all the packages at once, type the following:
$ sudo pkg install –y clang git cmake ninja
For development on OS X, it is best to install Xcode from the Apple store. While the XCode IDE is not used in this book, it comes with the required C/C++ compilers and supporting utilities. To install the other tools, you can use the Homebrew package manager (https://brew.sh/). To install all the packages at once, type the following:
$ brew install git cmake ninja
Like OS X, Windows does not come with a package manager. The easiest way to install all the software is to use the Chocolately (https://chocolatey.org/) package manager. To install all the packages at once, type the following:
$ choco install visualstudio2019buildtools cmake ninja git\
gzip bzip2 gnuwin32-coreutils.install
Please note that this only installs the build tools from Visual Studio 2019. If you would like to get the Community Edition (which includes the IDE), then you must install package visualstudio2019community instead of visualstudio2019buildtools. Part of the Visual Studio 2019 installation is the x64 Native Tools Command Prompt for VS 2019. Upon using this command prompt, the compiler is automatically added to the search path.
The LLVM project uses Git for version control. If you have not used Git before, then you should do some basic configuration of Git first before continuing; that is, setting a username and email address. Both pieces of information are used if you commit changes. In the following commands, replace Jane with your name and [email protected] with your email:
$ git config --global user.email "[email protected]"
$ git config --global user.name "Jane"
By default, Git uses the vi editor for commit messages. If you would prefer using another editor, then you can change the configuration in a similar way. To use the nano editor, type the following:
$ git config --global core.editor nano
For more information about git, please see the Git Version Control Cookbook - Second Edition by Packt Publishing (https://www.packtpub.com/product/git-version-control-cookbook/9781782168454).
With the build tools ready, you can now check out all the LLVM projects from GitHub. The command for doing this is essentially the same on all platforms. However, on Windows, it is recommended to turn off auto-translation for line endings.
Let's review this process in three parts: cloning the repository, creating a build directory, and generating the build system files.
On all non-Windows platforms, type in the following command to clone the repository:
$ git clone https://github.com/llvm/llvm-project.git
On Windows, you must add the option to disable line endings from being auto-translated. Here, type the following:
$ git clone --config core.autocrlf=false\ https://github.com/llvm/llvm-project.git
This git command clones the latest source code from GitHub into a local directory named llvm-project. Now, change the current directory to the new llvm-project directory with the following command:
$ cd llvm-project
Inside the directory is all the LLVM projects, each in its own directory. Most notably, the LLVM core libraries are in the llvm subdirectory. The LLVM project uses branches for subsequent release development ("release/12.x") and tags ("llvmorg-12.0.0") to mark a certain release. With the preceding clone command, you get the current development state. This book uses LLVM 12. To check out the first release of LLVM 12, type the following:
$ git checkout -b llvmorg-12.0.0
With this, you have cloned the whole repository and checked out a tag. This is the most flexible approach.
Git also allows you to clone only a branch or a tag (including history). With git clone --branch llvmorg-12.0.0 https://github.com/llvm/llvm-project, you check out the same label, as we did previously, but only the history for this tag is cloned. With the additional–-depth=1 option, you prevent the history from being cloned too. This saves time and space but obviously limits what you can do locally.
The next step is to create a build directory.
Unlike many other projects, LLVM does not support inline builds and requires a separate build directory. This can easily be created inside the llvm-project directory. Change into this directory with the following command:
$ cd llvm-project
Then, create a build directory called build for simplicity. Here, the commands for Unix and Windows systems differ. On Unix-likes system, you should use the following command:
$ mkdir build
On Windows, you should use the following command:
$ md build
Then, change into the build directory:
$ cd build
Now, you are ready to create the build system files with the CMake tool inside this directory.
To generate the build system files that will compile LLVM and Clang using Ninja, run the following command:
$ cmake –G Ninja -DLLVM_ENABLE_PROJECTS=clang ../llvm
Tip
On Windows, the backslash character, \, is the directory name separator. On Windows, CMake automatically translates the Unix separator, /, into the Windows one.
The -G option tells CMake which system to generate build files for. The most often used options are as follows:
Ninja: For the Ninja build systemUnix Makefiles: For GNU MakeVisual Studio 15 VS2017 and Visual Studio 16 VS2019: For Visual Studio and MS BuildXcode: For XCode projectsThe generation process can be influenced by setting various variables with the –D option. Usually, they are prefixed with CMAKE_ (if defined by CMake) or LLVM_ (if defined by LLVM). With the LLVM_ENABLE_PROJECTS=clang variable setting, CMake generates build files for Clang in addition to LLVM. The last part of the command tells CMake where to find the LLVM core library source. More on that in the next section.
Once the build files have been generated, LLVM and Clang can be compiled with the following command:
$ ninja
Depending on the hardware resources, this command takes between 15 minutes (a server with lots of CPU cores and memory and fast storage) and several hours (dual-core Windows notebook with limited memory) to run. By default, Ninja utilizes all available CPU cores. This is good for compilation speed but may prevent other tasks from running. For example, on a Windows-based notebook, it is almost impossible to surf the internet while Ninja is running. Fortunately, you can limit resource usage with the –j option.
Let's assume you have four CPU cores available and that Ninja should only use two (because you have parallel tasks to run). Here, you should use the following command for compilation:
$ ninja –j2
Once compilation is finished, a best practice is to run the test suite to check if everything works as expected:
$ ninja check-all
Again, the runtime of this command varies widely due to the available hardware resources. The Ninja check-all target runs all test cases. Targets are generated for each directory containing test cases. Using check-llvm, instead of check-all runs the LLVM tests but not the Clang tests; check-llvm-codegen only runs the tests in the CodeGen directory from LLVM (that is, the llvm/test/CodeGen directory).
You can also do a quick manual check. One of the LLVM applications you will be using is llc, the LLVM compiler. If you run it with the -version option, it shows the LLVM version of it, its host CPU, and all its supported architectures:
$ bin/llc -version
If you have trouble getting LLVM compiled, then you should consult the Common Problems section of the Getting Started with the LLVM System documentation (https://llvm.org/docs/GettingStarted.html#common-problems) for solutions to typical problems.
Finally, install the binaries:
$ ninja install
On a Unix-like system, the install directory is /usr/local. On Windows, C:\Program Files\LLVM is used. This can be changed, of course. The next section explains how.
The CMake system uses a project description in the CMakeLists.txt file. The top-level file is in the llvm directory; that is, llvm/CMakeLists.txt. Other directories also contain CMakeLists.txt files, which are recursively included during the build-file generation.
Based on the information provided in the project description, CMake checks which compilers have been installed, detects libraries and symbols, and creates the build system files, such as build.ninja or Makefile (depending on the chosen generator). It is also possible to define reusable modules, such as a function to detect if LLVM is installed. These scripts are placed in the special cmake directory (llvm/cmake), which is searched automatically during the generation process.
The build process can be customized by defining CMake variables. The–D command-line option is used to set a variable to a value. These variables are used in CMake scripts. Variables defined by CMake itself are almost always prefixed with CMAKE_, and these variables can be used in all projects. Variables defined by LLVM are prefixed with LLVM_ but they can only be used if the project definition includes the use of LLVM.
Some variables are initialized with the values of environment variables. The most notable are CC and CXX, which define the C and C++ compilers to be used for building. CMake tries to locate a C and a C++ compiler automatically, using the current shell search path. It picks the first compiler that's found. If you have several compilers installed, such as gcc and Clang or different versions of Clang, then this might not be the compiler you want for building LLVM.
Suppose you like to use clang9 as a C compiler and clang++9 as a C++ compiler. Here, you can invoke CMake in a Unix shell in the following way:
$ CC=clang9 CXX=clang++9 cmake ../llvm
This sets the value of the environment variables for the invocation of cmake. If necessary, you can specify an absolute path for the compiler executables.
CC is the default value of the CMAKE_C_COMPILER CMake variable, while CXX is the default value of the CMAKE_CXX_COMPILER CMake variable. Instead of using the environment variables, you can set the CMake variables directly. This is equivalent to the preceding call:
$ cmake –DCMAKE_C_COMPILER=clang9\
-DCMAKE_CXX_COMPILER=clang++9 ../llvm
Other useful variables defined by CMake are as follows:
CMAKE_INSTALL_PREFIX: A path prefix that is prepended to every path during installation. The default is /usr/local on Unix and C:\Program Files\<Project> on Windows. To install LLVM in the /opt/llvm directory, you must specify -DCMAKE_INSTALL_PREFIX=/opt/llvm. The binaries are copied to /opt/llvm/bin, the library files are copied to /opt/llvm/lib, and so on.CMAKE_BUILD_TYPE: Different types of builds require different settings. For example, a debug build needs to specify options for generating debug symbols and are usually linking against debug versions of system libraries. In contrast, a release build uses optimization flags and links against production versions of libraries. This variable is only used for build systems that can only handle one build type, such as Ninja or Make. For IDE build systems, all variants are generated, and you must use the mechanism of the IDE to switch between build types. Some possible values are as follows:DEBUG: Build with debug symbols
RELEASE: Build with optimization for speed
RELWITHDEBINFO: Release build with debug symbols
MINSIZEREL: Build with optimization for size
The default build type is DEBUG. To generate build files for a release build, you must specify -DCMAKE_BUILD_TYPE=RELEASE.
CMAKE_C_FLAGS and CMAKE_CXX_FLAGS: These are extra flags that are used when we're compiling C and C++ source files. The initial values are taken from the CFLAGS and CXXFLAGS environment variables, which can be used as alternatives.CMAKE_MODULE_PATH: Specifies additional directories that are searched for in CMake modules. The specified directories are searched before the default ones. The value is a semicolon-separated list of directories.PYTHON_EXECUTABLE: If the Python interpreter is not found or if the wrong one is picked if you have installed multiple versions of it, you can set this variable to the path of the Python binary. This variable only takes effect if the Python module of CMake is included (which is the case for LLVM).CMake provides built-in help for variables. The --help-variable var option prints help for the var variable. For instance, you can type the following to get help for CMAKE_BUILD_TYPE:
$ cmake --help-variable CMAKE_BUILD_TYPE
You can also list all the variables with the following command:
$ cmake --help-variablelist
This list is very long. You may want to pipe the output to more or a similar program.
The variables defined by LLVM work in the same way as those defined by CMake, except that there is no built-in help. The most useful variables are as follows:
LLVM_TARGETS_TO_BUILD: LLVM supports code generation for different CPU architectures. By default, all these targets are built. Use this variable to specify the list of targets to build, separated by semicolons. The current targets are AArch64, AMDGPU, ARM, BPF, Hexagon, Lanai, Mips, MSP430, NVPTX, PowerPC, RISCV, Sparc, SystemZ, WebAssembly, X86, and XCore. all can be used as shorthand for all targets. The names are case-sensitive. To only enable PowerPC and the System Z target, you must specify -DLLVM_TARGETS_TO_BUILD="PowerPC;SystemZ".LLVM_ENABLE_PROJECTS: This is a list of the projects you want to build, separated by semicolons. The source for the projects must be at the same level as the llvm directory (side-by-side layout). The current list is clang, clang-tools-extra, compiler-rt, debuginfo-tests, lib, libclc, libcxx, libcxxabi, libunwind, lld, lldb, llgo, mlir, openmp, parallel-libs, polly, and pstl. all can be used as shorthand for all the projects in this list. To build Clang and llgo together with LLVM, you must specify -DLLVM_ENABLE_PROJECT="clang;llgo".LLVM_ENABLE_ASSERTIONS: If set to ON, then assertion checks are enabled. These checks help find errors and are very useful during development. The default value is ON for a DEBUG build and OFF otherwise. To turn assertion checks on (for example, for a RELEASE build), you must specify –DLLVM_ENABLE_ASSERTIONS=ON.LLVM_ENABLE_EXPENSIVE_CHECKS: This enables some expensive checks that can really slow down your compilation speed or consume large amounts of memory. The default value is OFF. To turn these checks on, you must specify -DLLVM_ENABLE_EXPENSIVE_CHECKS=ON.LLVM_APPEND_VC_REV: LLVM tools such as llc display the LLVM version they are based on, besides other information if the–version command-line option is provided. This version information is based on the LLVM_REVISION C macro. By default, not only the LLVM version but also the Git hash of the latest commit is part of the version information. This is handy in case you are following the development of the master branch because it makes it clear which Git commit the tool is based on. If this isn't required, then this can be turned off with –DLLVM_APPEND_VC_REV=OFF.LLVM_ENABLE_THREADS: LLVM automatically includes thread support if a threading library is detected (usually, the pthreads library). Furthermore, in this case, LLVM assumes that the compiler supports thread-local storage (TLS). If you don't want thread support or your compiler does not support TLS, then you can turn it off with -DLLVM_ENABLE_THREADS=OFF.LLVM_ENABLE_EH: The LLVM projects do not use C++ exception handling, so they turn exception support off by default. This setting can be incompatible with other libraries your project is linking with. If needed, you can enable exception support by specifying –DLLVM_ENABLE_EH=ON.LLVM_ENABLE_RTTI: LVM uses a lightweight, self-built system for runtime type information. Generating C++ RTTI is turned off by default. Like the exception handling support, this may be incompatible with other libraries. To turn generation for C++ RTTI on, you must specify –DLLVM_ENABLE_RTTI=ON.LLVM_ENABLE_WARNINGS: Compiling LLVM should generate no warning messages if possible. Due to this, the option to print warning messages is turned on by default. To turn it off, you must specify –DLLVM_ENABLE_WARNINGS=OFF.LLVM_ENABLE_PEDANTIC: The LLVM source should be C/C++ language standard-conforming; hence, pedantic checking of the source is enabled by default. If possible, compiler-specific extensions are also disabled. To reverse this setting, you must specify –DLLVM_ENABLE_PEDANTIC=OFF.LLVM_ENABLE_WERROR: If set to ON, then all the warnings are treated as errors – the compilation aborts as soon as warnings are found. It helps to find all the remaining warnings in the source. By default, it is turned off. To turn it on, you must specify –DLLVM_ENABLE_WERROR=ON.LLVM_OPTIMIZED_TABLEGEN: Usually, the tablegen tool is built with the same options as the other parts of LLVM. At the same time, tablegen is used to generate large parts of the code generator. As a result, tablegen is much slower in a debug build, thus increasing the compile time noticeably. If this option is set to ON, then tablegen is compiled with optimization turned on, even for a debug build, possibly reducing compile time. The default is OFF. To turn this on, you must specify –DLLVM_OPTIMIZED_TABLEGEN=ON.LLVM_USE_SPLIT_DWARF: If the build compiler is gcc or Clang, then turning on this option will instruct the compiler to generate the DWARF debug information in a separate file. The reduced size of the object files reduces the link time of debug builds significantly. The default is OFF. To turn this on, you must specify -LLVM_USE_SPLIT_DWARF=ON.LLVM defines many more CMake variables. You can find the complete list in the LLVM documentation of CMake (https://releases.llvm.org/12.0.0/docs/CMake.html#llvm-specific-variables). The preceding list only contains the ones you are likely to need.
In this chapter, you prepared your development machine to compile LLVM. You cloned the LLVM GitHub repository and compiled your own versions of LLVM and Clang. The build process can be customized with CMake variables. You also learned about useful variables and how to change them. Equipped with this knowledge, you can tweak LLVM for your needs.
In the next chapter, we will take a closer look at the contents of the LLVM mono repository. You will learn which projects are in it and how the projects are structured. You will then use this information to create your own project using LLVM libraries. Finally, you will learn how to compile LLVM for a different CPU architecture.
The LLVM mono repository contains all the projects under the llvm-project root directory. All projects follow a common source layout. To use LLVM effectively, it is good to know what is available and where to find it. In this chapter, you will learn about the following:
The contents of the LLVM mono repository, covering the most important top-level projectsThe layout of an LLVM project, showing the common source layout used by all projectsHow to create your own projects using LLVM libraries, covering all the ways you can use LLVM in your own projectsHow to target a different CPU architecture, showing the steps required to cross-compile to another systemThe code files for the chapter are available at https://github.com/PacktPublishing/Learn-LLVM-12/tree/master/Chapter02/tinylang
You can find the code in action videos at https://bit.ly/3nllhED
In Chapter 1, Installing LLVM, you cloned the LLVM mono repository. This repository contains all LLVM top-level projects. They can be grouped as follows:
LLVM core libraries and additionsCompilers and toolsRuntime librariesIn the next sections, we will take a closer look at these groups.
The LLVM core libraries are in the llvm directory. This project provides a set of libraries with optimizers and code generation for well-known CPUs. It also provides tools based on these libraries. The LLVM static compiler llc takes a file written in LLVM intermediate representation (IR) as input and compiles it into either bitcode, assembler output, or a binary object file. Tools such as llvm-objdump and llvm-dwarfdump let you inspect object files, and those such as llvm-ar let you create an archive file from a set of object files. It also includes tools that help with the development of LLVM itself. For example, the bugpoint tool helps to find a minimal test case for a crash inside LLVM. llvm-mc is the machine code playground: this tool assembles and disassembles machine instructions and also outputs the encoding, which is a great help when adding new instructions.
The LLVM core libraries are written in C++. Additionally, a C interface and bindings for Go, Ocaml, and Python are provided.
The Polly project, located in the polly directory, adds another set of optimizations to LLVM. It is based on a mathematical representation called the polyhedral model. With this approach, complex optimizations such as loops optimized for cache locality are possible.
The MLIR project aims to provide a multi-level intermediate representation for LLVM. The LLVM IR is already at a low level, and certain information from the source language is lost during IR generation in the compiler. The idea of MLIR is to make the LLVM IR extensible and capture this information in a domain-specific representation. You will find the source in the mlir directory.
A complete C/C++/Objective-C/Object-C++ compiler named clang (http://clang.llvm.org/) is part of the LLVM project. The source is located in the clang directory. It provides a set of libraries for lexing, parsing, semantic analysis, and generation of LLVM IR from C, C++, Objective-C, and Objective-C++ source files. The small tool clang is the compiler driver, based on these libraries. Another useful tool is clang-format, which can format C/C++ source files and source fragments according to rules provided by the user.
Clang aims to be compatible with GCC, the GNU C/C++ compiler, and CL, the Microsoft C/C++ compiler.
Additional tools for C/C++ are provided by the clang-tools-extra project in the directory of the same name. Most notable here is clang-tidy which is a Lint style checker for C/C++. clang-tidy uses the clang libraries to parse the source code and checks the source with static analysis. The tool can catch more potential errors than the compiler, at the expense of more runtime.
Llgo is a compiler for the Go programming languages, located in the llgo directory. It is written in Go and uses the Go bindings from the LLVM core libraries to interface with LLVM. Llgo aims to be compatible with the reference compiler (https://golang.org/) but currently, the only supported target is 64-bit x86 Linux. The project seems unmaintained and may be removed in the future.
The object files created by a compiler must be linked together with runtime libraries to form an executable. This is the job of lld (http://lld.llvm.org/), the LLVM linker that is located in the lld directory. The linker supports the ELF, COFF, Mach-O, and WebAssembly formats.
No compiler toolset is complete without a debugger! The LLVM debugger is called lldb (http://lldb.llvm.org/) and is located in the directory of the same name. The interface is similar to GDB, the GNU debugger, and the tool supports C, C++, and Objective-C out of the box. The debugger is extensible so support for other programming languages can be added easily.
In addition to a compiler, runtime libraries are required for complete programming language support. All the listed projects are located in the top-level directory in a directory of the same name:
The compiler-rt project provides programming language-independent support libraries. It includes generic functions, such as a 64-bit division for 32-bit i386, various sanitizers, the fuzzing library, and the profiling library.The libunwind library provides helper functions for stack unwinding based on the DWARF standard. This is usually used for implementing exception handling of languages such as C++. The library is written in C and the functions are not tied to a specific exception handling model.The libcxxabi library implements C++ exception handling on top of libunwind and provides the standard C++ functions for it.Finally, libcxx is an implementation of the C++ standard library, including iostreams and STL. In addition, the pstl project provides a parallel version of the STL algorithm.libclc is the runtime library for OpenCL. OpenCL is a standard for heterogeneous parallel computing and helps with moving computational tasks to graphics cards.libc aims to provide a complete C library. This project is still in its early stages.Support for the OpenMP API is provided by the openmp project. OpenMP helps with multithreaded programming and can, for instance, parallelize loops based on annotations in the source.Even though this is a long list of projects, the good news is that all projects are structured similarly. We look at the general directory layout in the next section.
All LLVM projects follow the same idea of directory layout. To understand the idea, let's compare LLVMwith GCC, the GNU Compiler Collection. GCC has provided mature compilers for decades for almost every system you can imagine. But, except for the compilers, there are no tools that take advantage of the code. The reason is that it is not designed for reuse. This is different with LLVM.
Every functionality has a clearly defined API and is put in a library of its own. The clang project has (among others) a library to lex a C/C++ source file into a token stream. The parser library turns this token stream into an abstract syntax tree (also backed by a library). Semantic analysis, code generation, and even the compiler driver are provided as a library. The well-known clang tool is only a small application linked against these libraries.
The advantage is obvious: when you want to build a tool that requires the abstract syntax tree (AST) of a C++ file, then you can reuse the functionality from these libraries to construct the AST. Semantic analysis and code generation are not required and you do not link against these libraries. This principle is followed by all LLVM projects, including the core libraries!
Each project has a similar organization. Because CMake is used for build file generation, each project has a CMakeLists.txt file that describes the building of the projects. If additional CMake modules or support files are required, then they are stored in the cmake subdirectory, with modules placed in cmake/modules.
Libraries and tools are mostly written in C++. Source files are placed under the lib directory and header files under the include directory. Because a project typically consists of several libraries, there are directories for each library in the lib directory. If necessary, this repeats. For example, inside the llvm/lib directory is the Target directory, which holds the code for the target-specific lowering. Besides some source files, there are again subdirectories for each target that are again compiled into libraries. Each of these directories has a CMakeLists.txt file that describes how to build the library and which subdirectories also contain source.
The include directory has an additional level. To make the names of the include files unique, the path name includes the project name, which is the first subdirectory under include. Only in this folder is the structure from the lib directory repeated.
The source of applications is inside the tools and utils directories. In the utils directory are internal applications that are used during compilation or testing. They are usually not part of a user installation. The tools directory contains applications for the end user. In both directories, each application has its own subdirectory. As with the lib directory, each subdirectory that contains source has a CMakeLists.txt file.
Correct code generation is a must for a compiler. This can only be achieved with a good test suite. The unittest directory contains unit tests that use the Google Test framework. This is mainly used for single functions and isolated functionality that can't be tested otherwise. In the test directory are the LIT tests. These tests use the llvm-lit utility to execute tests. llvm-lit scans a file for shell commands and executes them. The file contains the source code used as input for the test, for example, LLVM IR. Embedded in the file are commands to compile it, executed by llvm-lit. The output of this step is then verified, often with the help of the FileCheck utility. This utility reads check statements from one file and matches them against another file. The LIT tests themselves are in subdirectories under the test directory, loosely following the structure of the lib directory.
Documentation (usually as reStructuredText) is placed in the docs directory. If a project provides examples, they are in the examples directory.
Depending on the needs of the project, there can be other directories too. Most notably, some projects that provide runtime libraries place the source code in a src directory and use the lib directory for library export definitions. The compiler-rt and libclc projects contain architecture-dependent code. This is always placed in a subdirectory named after the target architecture (for example, i386 or ptx).
In summary, the general layout of a project that provides a sample library and has a driver tool looks like this:
Figure 2.1 – General project directory layout
Our own project will follow this organization, too.
Based on the information in the previous section, you can now create your own project using LLVM libraries. The following sections introduce a small language called Tiny. The project will be called tinylang. Here the structure for such a project is defined. Even though the tool in this section is only a Hello, world application, its structure has all the parts required for a real-world compiler.
The first question is if the tinylang project should be built together with LLVM (like clang), or if it should be a standalone project that just uses the LLVM libraries. In the former case, it is also necessary to decide where to create the project.
Let's first assume that tinylang should be built together with LLVM. There are different options for where to place the project. The first solution is to create a subdirectory for the project inside the llvm-projects directory. All projects in this directory are picked up and built as part of building LLVM. Before the side-by-side project layout was created, this the standard way to build, for example, clang.
A second option is to place the tinylang project in the top-level directory. Because it is not an official LLVM project, the CMake script does not know about it. When running cmake, you need to specify –DLLVM_ENABLE_PROJECTS=tinylang to include the project in the build.
And the third option is to place the project directory somewhere else, outside the llvm-project directory. Of course, you need to tell CMake about this location. If the location is /src/tinylang, for example, then you need to specify –DLLVM_ENABLE_PROJECTS=tinylang –DLLVM_EXTERNAL_TINYLANG_SOURCE_DIR=/src/tinylang.
If you want to build the project as a standalone project, then it needs to find the LLVM libraries. This is done in the CMakeLists.txt file, which is discussed later in this section.