Notes/Primer on Clang Compiler Frontend (1) : Introduction and Architecture

Posted on Aug 5, 2024

Notes/Primer on Clang Compiler Frontend: Introduction and Architecture

These are my notes on chapters 1 & 2 of the Clang Compiler Frontend by Ivan Murashko. The book is focused on teaching the fundamentals of LLVM to C++ engineers who are interested in learning about compilers to optimize their daily workflow by enhancing their code quality and overall development process. (I’ve referened this book extensively, and a lot of the snippets here are from this book. I’d highly recommend buying it for a deeper dive: https://www.amazon.com/Clang-Compiler-Frontend-Understand-internals/dp/1837630984)

We will explore Chapter 1 & 2, which describe the basic steps required to set up the development environment and touch upon Clang Architecture.

Chapter 1 (Introduction):

The Environment Setup is quite straightforward. You need a Unix-based OS, CLI Git, and Build Tools (CMake for Project Configuration, Ninja for Building the Project itself).
CMake is the primary build system for LLVM, replacing autoconf because of its better support for cross-platform building.
Ninja is a fast build system that executes the minimum amount of files required by keeping track of the dependencies between build targets and only building and rebuilding targets that are out of date, so we are going to use it in conjunction with CMake!
Download CMake using git clone https://github.com/llvm/llvm-project.git or brew install cmake and install LLVM using brew install llvm.
I’m going to skip over the history of the LLVM project, which is quite interesting in and of itself.

lld is the LLVM Linker tool. llvm and clang-tools-extra contain libraries and extra tools that we will use in the upcoming chapters. clang is the clang driver and frontend.

LLVM Projects (clang or llvm) include an include and a lib folder. include contains the header files while lib contains the implementation files.
Each LLVM Project folder contains a variety of tests. We will discuss those (particularly LLVM Integrated Tester) in Chapter 4.
clang-tools-extra contains Clangd (Language server that provides navigation info for IDEs), Clang-tidy (a Lint framework), and Clang-format (a Code Formatting tool).

Chapter 1 (Building):

We will start with an overview of the build process and finish with building LLDB (The debugging tool) as a concrete example.
First, create a build folder using the minimal configuration command (but with -G Ninja) (more advanced commands are available, although I will go with the minimal one here).

If you want to save space and time, use this command below with the following changes: DLLVM_TARGETS_TO_BUILD= (YOUR CPU ARCHITECTURE HERE), DLLVM_USE_SPLIT_DWARF (Splits debug information into separate files, saving space and memory during the build process), DCMAKE_INSTALL_PREFIX (Specifies the installation folder).
Note: The Debug build might be slow, a good compromise between debuggability and performance would be using the Release with Debug Info (RelWithDebInfo) as your BUILD_TYPE.

A list of options can be found here: LLVM CMake Options.

We will jump straight into the Test Project (Syntax Check with a Clang tool): We will create a simple Clang tool that runs the compiler and checks the syntax for the provided source file. The project is an out-of-tree LLVM Project, meaning it will use LLVM but will be located outside the LLVM Source Tree.
Several actions are required to create the project (The commands have to be run from the build folder we created earlier. Make sure to change TARGETS_TO_BUILD and USE_LINKER if applicable):

The command I’m using is:
cmake -G Ninja -DCMAKE_BUILD_TYPE=Debug -DCMAKE_INSTALL_PREFIX=../install -DLLVM_TARGETS_TO_BUILD="ARM;X86;AArch64" -DLLVM_ENABLE_PROJECTS="clang" -DLLVM_USE_SPLIT_DWARF=ON -DBUILD_SHARED_LIBS=ON ../llvm

Then run ninja install.
Now you need to create two files: The Project Configuration Code CMakeLists.txt, and The Project Source Code SyntaxCheck.cpp. (You can find both files on the GitHub repository). Note: In the CMakeLists.txt you need to set the C++ standard to something other than C++98, which Apple’s Clang defaults to.

Now create a new directory called build, cd into build, and then run:
cmake -G Ninja ..
ninja

You should have syntax-check in the build now!
Use syntax-check --help to use the tool and play with it. You can pass it a cpp file, and if the file is valid, it will terminate successfully.

Chapter 2 (Clang Architecture and Compiler Workflow):

Despite compilers being used to translate programs from one form to another, Compilers can be considered large software systems in and of themselves. In this chapter we explore the typical workflow of a Compiler and Clang’s role in that workflow.

As we can see here, the Compiler takes the source code (a cpp file for example) and any compilation flags (we will explore a few of them in this chapter) and will produce a Target code/Object file.

Now the target code is in machine code, which might differ from Architecture to another, so modularizing the process of compiling was a goal in mind, and that’s what using Clang & LLVM allows us to do!

Here is an overview of the Source code transformation process by a compiler

The Source code (eg. a cpp) is transformed into IR by the Frontend, The middle-end performs various optimizations on the IR and passes the optimized-code into the Backend, which generates the Target Code, the Compile options (eg. passing in a specific library for the linker to include) are used by all three phases as settings for code transformations.

The Frontend:

The primary goal of the Frontend is to transform the source code into an Immediate Representation, The Lexer first converts the code into a set of tokens, then passes them into the Parser, which converts the Tokens into a special structure called the Abstract Syntax Tree (ATS). the final component, the Code Generator (Codegen) traverses the AST and generates the IR from it.

We are going to be using the Code below to demonstrate the workings of the Frontend.

The Lexer:

The Lexer first converts the input source into a stream of tokens, each token representing an “object” in the source.

The Parser:

The Parser then takes the tokens and creates the AST, which represents an abstract syntactic structure of the source code. Each node in the tree represents a construct in the source code (like a statement or an expression), while each edge between the nodes represents the relationships between those constructs.

Now, the parser performs two activities: Syntax Analysis and Semantic Analysis, It will only produce an AST if neither of these checks fails.

Syntax Analysis is the Parser checking if the code is correct in terms of the grammar specified for the language, for example this code below demonstrates Wrong Syntax:

A program can be syntactically correct but the code can make no sense! This where Semantic Analysis comes into play. For example, this piece of code below contains an & (an address operator in C++) where it shouldn’t be.

AST is mainly constructed as a result of Syntax Analysis, but for Languages like C++, Semantic Analysis is crucial, particularly for template instantiation (which could be quite complex, as the compiler must perform tasks such as Type Checking, name resolution, and more)

The Codegen:

The Codegen or The Code Generator (Note: There is another Codegen component that is part of the Backend, mainly used to generate Target code) Is the Final Component of the Frontend.

The Compiler takes the ATS generated by the Parser and converts it into an Immediate Representation (IR). The IR is a *Language-Independent* representation, allowing us to use the same middle end (Optimizations) with different Frontends (eg. C++ or Fortran). Another reason is that if we have a new architecture, we can generate the target code specific to that architecture, without having to change the steps up to the IR generation.

The Clang Driver:

The Clang driver is the command-line utility that manages the overall compilation process (called in the CLI by invoking ‘clang’) , not to be confused with the Clang frontend. In other compilers (Like GCC) the driver and the compiler can be different and separate executables, Clang is a single executable that functions as both the driver and the compiler frontend. (Note: you can use Clang as the compiler frontend only by passing the ‘-cc1’ flag into it)

Here we specified the Clang driver by referring to the path of the clang executable. hello.cpp is the source code that we want to compile and we specified the path output for the executable (/tmp/hello) using the -o flag. -lstdc++ is used to include the <iostream> header from the std C++ library.

Clang uses the same typical compiler workflow as shown before, you can pass in ‘-ccc-print-phases ‘ to show the phases of the workflow, which produces this output:

The Clang Frontend:
While the Clang Compiler toolchain follows a similar workflow to other Compilers, the Frontend part differs in certain aspects mainly due to the complexity of the C++ Language. For example, some features, such as Macros can alter the source code itself, while Typedef can influence the kind of token.

Clang can also generate output in a variety of formats, and based on the compiler options Clang can execute and produce the output for one action at a time, which I find quite pleasant. Here are few of the Compiler options:

As mentioned before, The Clang Frontend differs from other Frontends in a few aspects. For instance, as you can see below the Lexer is referred to as the Preprocessor due to the lexer implementation being encapsulated in the (Preprocessor) class, this alteration was due to the fact that C++ code sometimes may require unique preprocessing, such as in the case of Macros.

While conventional compilers usually do the Syntactic and Semantic analysis within the parser, Clang splits them into two separate components. The ‘Parser’ Component focuses solely on the syntax analysis, while The ‘Sema’ Component handles Semantic Analysis. (Note: The (codegen) class serves as the basis for allowing different types of output formats such as the ‘EmitBCAction’ and ‘EmitLLVMAction’ compiler options in the table above)

We will use the same ‘Max’ function code to examine the components of the Clang frontend. (Note: we utilize the -cc1 option, which allows us to invoke the Clang frontend directly, bypassing the Clang Driver and giving us the ability to analyze the inner workings in greater details)

Preprocessor:

We can dump the token stream by passing in the ‘-dump-tokens’ flag.
The output of the command is shown below:

There are many different types of tokens! We have keywords (int, return), identifiers (max, a, b), and special symbols (semicolon, and parentheses). These tokens are called normal tokens, which are returned by the lexer.

In addition to normal tokens, Clang has Annotation tokens, which are special tokens that store additional semantic information, these tokens can replace a bunch of normal tokens, which aids in performance (as it allows for the prevention of reparsing when the parser has to backtrack) (I recommend reading more about this, as always, buying the book is a good way to do so)

The Preprocessor has two different helper classes to retrieve tokens: The (Lexer) Class, used to convert a text buffer into a stream of tokens , and the (TokenLexer) class is used to retrieve tokens from Macro expansions (Note: only one of these helpers can be active at a time).

The #include directive may also contain its own Macros, thus the (Preprocessor) class keeps a stack of lexers for each #include directive as shown here.

The Parser and Sema:

The Parser and Sema handle the syntax and semantic checks, producing an AST as the output. We can pass in the ‘-ast-dump’ flag to visualize the tree.

The output of the command is shown below:

Clang utilizes a hand written recursive–descent parser, and is considered simple! Let’s explore how this parser works:

Parsing begins with a top-level declaration known as a (TranslationUnitDecl) representing a single translation unit. The C++ Standard defines a Translation unit as follows:
While the process of AST Node creation varies across different C++ Constructs, it usually follows the following pattern:

The Square boxes represents the corresponding classes, while the edges represent the function calls.

Here the Parser calls the Preprocessor::Lex Method to retrieve a token from the lexer, say XXX, then it calls a method corresponding to the token, Parser::ParseXXX, this method then calls Sema::ActonXXX which then creates the corresponding object XXX::Create. The Process is then repeated with a new token.

As we can see here, the Preprocessor works with the Parser (the Parser and Sema Components) in order to create the AST, which is essential not only for Code Generation, but modification and analysis.

In the Next Chapter we will dive into the details of the AST. Stay Tuned!