Notes/Primer on Clang Compiler Frontend (3) : The Unique Basic Libraries, Data Structures, Tools, and LLVM Test Framework

Posted on Aug 5, 2024

Notes/Primer on Clang Compiler Frontend: The Unique Basic Libraries, Data Structures, Tools, and LLVM Test Framework

These are my notes on chapter 4 of the Clang Compiler Frontend by Ivan Murashko. (I’ve referened this book extensively, and a lot of the snippets here are from this book. I’d highly recommend buying it for a deeper dive: https://www.amazon.com/Clang-Compiler-Frontend-Understand-internals/dp/1837630984)

LLVM is written in C++, and while it actively utilizes functionality provided by the Standard Template Library (STL) it still contains a lot of internal implementations that are primarily aimed at optimizing performance. For example ‘llvm::SmallVector’ is similar to ‘std::vector’ but contains an internally optimized implementation.

Additionally, LLVM has introduced other development tools such as TableGen, a DSL (domain specific language) designed for structural data processing and LIT (LLVM integrated Tester). We will discuss those and more in this chapter.

So it’s important for our purposes to be familiar with these extensions, and we will explore some of them today!

LLVM Coding Style

LLVM adheres to specific code-style rules that aim to promote proficient C++ practices with a special focus on performance, LLVM prefers using methods from the STL, but also offers optimized versions that mirror those from the STL, like the ‘llvm:SmallVector’ and ‘std::vector’ that we mentioned above. Given a choice between an STL object/algorithm and its corresponding LLVM version, the LLVM coding standard advises favoring the LLVM version.

There are Additional rules that pertain to concerns regarding performance issues, for example, both run-time type information (RTTI) and C++ exceptions are disallowed, however there are situations where RTTI could prove beneficial, thus LLVM offers alternatives like ‘llvm::isa<>’ (which we will discuss later) and other similar helper functions. Instead of C++ exceptions, LLVM frequently employs C-style ‘assertS’.

Sometimes though, asserts are not sufficiently informative. LLVM recommends adding textual messages to them to simplify debugging, here is an example from Clang’s code:

In this code, we check if the second parameter (RD) is a union and raise an assert if it’s not.

Besides performance considerations, LLVM also introduces some additional requirements. One of these requirements is about Comments.

Code comments are very important, and LLVM and Clang have comprehensive documentation, and Doxygen (https://www.doxygen.nl/) is the de facto standard used for commenting in C/C++ programs.

Clang and LLVM are not monolithic pieces of code; instead, they are implemented as a set of libraries. This design provides advantages in terms of code and functionality reuse, so let’s examine some of these libraries in detail.

LLVM basic libraries

We are going to discuss RTTI replacement in LLVM code, basic containers and smart pointers, and some important classes used to represent token locations and how diagnostics are implemented in Clang.

RTTI replacement and cast operators.

LLVM has introduced several helper function that replace RTTI counterparts, the fundamental ones are:

llvm::isa<>: it returns ‘true’ or ‘false’ depending on whether the reference to the tested object belongs to the tested class or not.
llvm::cast<>: cast operator (use it only when you are certain that the object is of the specified derived type, if the cast fails, the program will abort)
llvm::dyn_cast<>: the most used casting operator in LLVM, we use it for safe downcasting when you anticipate the cast will usually succeed, but there’s some uncertainty. If the object isn’t of the specified derived type, it will return nullptr.

These cast operators don’t accept nullptr as input, however these two do (if the cast fails they return nullptr):

llvm::cast_if_present<>: llvm:cast<> variant that accepts nullptr values.
llvm::dyn_cast<>: llvm:dyn_cast<> variant that accepts nullptr values!

So now you might have a question, how can a dynamic_cast work without RTTI? This can be illustrated by the following example:

We will begin with a base class, ‘clangbook::Animal’ that has two descendants: ‘clangbook::Horse’ and ‘clangbook::Sheep’. Each horse can be categorized by its speed and each sheep by its wool mass:

This code should output: Animal is a Horse and the horse speed is: 10mph. We use llvm::dyn_cast<> to cast the base class to clangbook::Horse and call a method specific to that class.

Let’s look at the classes implementations to figure out how the RTTI alternative works, let’s start with the base class:

The Enum is the vital part in this code, as it specifies the different “kinds” of animals. One enum value is used for the horse (AK_Horse) and another for the sheep (AK_Sheep); so, the base class has some knowledge about its descendants. Now let’s look at the implementations for clangbook::Horse and clangbook::Sheep:

You see the class of static method? This method is crucial for the cast operators in LLVM. a typical implementation might look like this:

The same mechanism above can be applied to other cast operators.

More Powerful Containers

The LLVM ADT (Abstract Data Type) library offers a set of containers, some of which are unique to LLVM, others are replacements for containers from the STL. Let’s explore some of them here:

String Operations:

The primary class for working with strings in the STL is ‘std::string’, It has some performance related issues. A significant issue concerns the copy operation. Since copying strings is a common operation in compilers, LLVM introduced a specialized class, ‘llvm::StringRef’, that handles this operation efficiently without using extra memory. This class is comparable to ‘std::string_view’ and ‘std::span’.

The ‘llvm::StringRef’ class maintains a reference to data, which doesn’t need to be null-terminated like traditional C/C++ strings. It basically holds a pointer to a data block and that block’s size, making the object effective size 16 bytes. Because ‘llvm::StringRef’ retains a reference rather than the actual data, it must be constructed from an existing data source. This class can be instantiated from basic string objects such as ‘const char*’, ‘std::string’, and ‘std::string_view’. The default constructor creates an empty object. A typical usage example is this:

The output for the code above is:

Original StringRef: Hello, LLVM!

Substring: Hello

Another class used for string manipulation in LLVM is ‘llvm::Twine’, which is particularly useful when concatenating several objects into one. A typical example for this class is:

The output is going to be: Hello, Twine!

Another class that is widely used for string manipulations is ‘llvm::SmallString<>’. It represents a string that is stack-allocated up to a fixed size, but can grow beyond that size (so it heap-allocates memory). This is a blend between the space efficiency of stack allocation and the flexibility of heap allocation!

The advantage of using ‘llvm::SmallString’ is that for many scenarios, especially in compiler tasks, strings tend to be small and fit within the stack-allocated space, but in situations where a larger string is required, you can also accommodate by switching to heap memory. Here is an example:

No heap allocation happens here, but if you were to append a string and you go beyond the 20 characters allocated for the object, you can still go ahead with it because it will switch to heap memory.

Now let’s talk about sequential containers!

Sequential Containers

LLVM recommends some optimized replacements for arrays and vectors from the STL, the most notable are:

‘llvm::ArrayRef<>’: A helper class designed for interfaces that accept a sequential list of elements for read-only access (Like ‘llvm::StringRef<>’ it doesn’t not own the underlying data, in fact, classes with ‘Ref’ in the name are usually read-only).
‘llvm::SmallVector<>’: an optimized vector for cases with small size, it resembles ‘llvm::SmallString’ where the size of the array isn’t fixed, and if the number of elements stay below the template argument, then there is no need for additional memory allocation

Let’s examine the ‘llvm::SmallVector<>’ class:

The vector is initialized with size 10, we add 10 elements using the push_back method. But in line 5 when we add an 11th element we exceed the allocated memory, and thus we trigger additional memory allocation, and the good thing about this container’s design is that it minimizes memory allocation for small objects, while maintaining the flexibility to accommodate larger sizes when necessary

Map-like Containers

The STL provides several containers for storing (key-value) data. The most common ones are ‘std::map<>’ for general-purpose maps and ‘std::unordered_map<>’ for hash maps. Here are the LLVM alternatives:

‘llvm::StringMap<>’: A map that uses strings as keys, this is more performance optimized than it’s equivalent, ‘std::unordered_map<std::string, T>. It doesn’t store a copy of the string key, instead, it keeps a reference to the string data, so it’s important to ensure the string data outlives the map to prevent undefined behavior.
‘llvm::DenseMap<>’: A map designed to be more memory and time efficient than ‘std::unordered_map<>’.
‘llvm::DenseMap<>’: Think of it as DenseMap<> but for small map sizes, also switches to heap allocation when the map exceeds a predefined size.
‘llvm::MapVector<>’: This container retains the insertion order (Think Python’s OrderedDict). It is implemented as a blend of ‘std::vector’ and DenseMap or SmallDenseMap

(Note: These containers utilize a quadratically proved hash-table mechanism, which is effective for hash collision resolution because the cache isn’t recomputed during element lookups., this is crucial for performance-critical applications, like compilers!)

Smart Pointers

Different smart pointers are found in LLVM Code, the most popular ones are ‘std::unique_ptr<>’ and ‘std::shared_ptr<>’. LLVM provides supplementary classes to work with smart pointers, the most prominent one being ‘llvm::IntrusiveRefCntPtr<>’. This smart pointer is designed to work with objects that support intrusive reference counting (Where the number of references of objects is tracked within that object’s implementation). Unlike ‘std::shared_ptr’ which maintains its own control block to manage the reference count, IntrusiveRefCntPtr expects the object to maintain its own reference count. Here is a typical usage example:

As you can see, The smart pointer employs the CRTP! The CRTP is essential for the ‘Release’ operation when the reference count drops to 0 and the object must be deleted

We can perform the cast in line 6 because the type is known (and thus we can ensure that the pointer is cast to the derived type, and not the base class, before deleting.), and that is why we used CRTP!

Clang Basic Libraries

Clang is a compiler frontend, and its most important operations are related to diagnostics, which requires precise information about positions in the source code. So Let’s explore the basic classes that Clang provides for these operations.

SourceManager and SourceLocation

Consider an error message from a Program:

The Following information is required to display the message:

Filename
Line in the file
Column in the file

The data structure that stores this information should be as compact as possible because the compiler uses it frequently. Clang stores this information in the ‘clang::SourceLocation’ object.

We can check the size of the object using lldb. For instance, if we run clang under the debugger we can determine the size as follows:

The information is encoded using a single ‘unsigned long’ number. Huh? The number only identifies a position in the text (program) file, an additional class is required to correctly extract and represent this information, which is ‘clang::SourceManager’.

The ‘SourceManager’ Object contains all the details about a specific location, which can be challenging given all the macros, includes, and other preprocessing directives.

There are several ways to interpret a given source location, here are the primary ones:

Spilling Location: Refers to the location where something was actually Spelled out in the source. So if you have a source location pointing inside a macro body, the spelling location will give you the location in the source code where the contents of the macro are defined.
Expansion Location: in the example of the macros above, this will instead give you the location in the source code where the macro was used

Diagnostics support

Clang’s diagnostic subsystem is responsible for generating and reporting warnings, errors, and other messages. Here are the main classes involved:

‘DiagnosticsEngine’: Manages diagnostic IDs and options
‘DiagnosticConsumer’: Abstract base class for diagnostic consumers
‘DiagnosticIDs’: Handles the mapping between diagnostic flags and internal IDs
‘DiagnosticInfo’: Represents a single diagnostic

Here is an example of how you can emit a warning in Clang:

Here is an example using ‘clang::TextDiagnosticPrinter’, which formats and prints the processed diagnostic messages:

The code above will produce this output: warning: This is a custom warning. Here we set ‘DiagnosticEngine’ with ‘TextDiagnosticPrinter’ as its ‘DiagnosticConsumer’, We then use the ‘Report’ method of ‘DiagnosticEngine’ to emit a custom warning.

LLVM Supporting Tools

The LLVM project has its own tooling support, with the most important LLVM tools being ‘TableGen’ and ‘LIT’ (LLVM Integrated Tester). We will look at them with examples to further clarify their purpose and how they can be used.

TableGen

TableGen is a domain-specific language (DSL) and a tool used in LLVM for the purpose of describing and generating tables (especially those that describe a target architecture). This is useful for compiler infrastructure, where one needs to describe things such as instruction sets, registers, and various other target-specific attributes in a structured manner.

TableGen is employed in various parts of the Clang compiler. It’s often used where there’s a need to generate large amounts of similar code. For instance, it can be used for supporting cast operations that require extensive enum declarations in basic classes, or in the diagnostic subsystem where code generation is required to handle numerous similar diagnostic messages. We will examine how TableGen functions within the diagnostic system as an example.

We will begin with the Diagnostic.td file, which describes Clang’s diagnostics. This file can be found at ‘clang/include/clang/Basic/Diagnostic.td’. Let’s examine how diagnostic severity is defined

As you can see, we defined a class for severities, Where each severity is associated with a string, as shown below:

Here we define the different severities, later we use the severity (Warning for example) to define the ‘Diagnostic’ class.

Using the warning class definition, different instances of the class can be defined. For example, the following is an instance that defines an unused parameter warning located in ‘DiagnosticSemaKinds.td’:

The clang-tblgen tool will generate the corresponding ‘DiagnosticSemaKinds.inc’ file:

This file retains all the necessary information about the diagnostic. This information can be retrieved from the Clang source code using different definitions of the ‘DIAG’ macro.

For instance, the following code uses the TableGen-generated code to extract the diagnostic descriptions as found in ‘clang/lib/Basic/DiagnosticIDs.cpp’

The C++ preprocessor will then expand to the following:

This example demonstrates how TableGen can be used to generate code in Clang and how it can simplify Clang development. The diagnostic subsystem is not the only area where TableGen is utilized; For example, The macros used in different types of AST visitors also rely on the code generated by TableGen!

LLVM Test Framework

LLVM uses several testing frameworks depending on the type of testing. The primary ones are LIT and GTest (Google Test).

LIT is primarily used for testing the behavior of the Clang toolchain as a whole, with a focus on the code compilation and diagnostics.
GTest is utilized for unit tests, targeting specific components of the code base, primarily utility libraries and internal data structures.

(Note: For now, we will not delve into GTest as its commonly used outside LLVM and isn’t part of LLVM itself, but I might write a blogpost about it sometimes. Take a look at this: https://github.com/google/googletest)

We will focus on LIT. LIT is designed to be lightweight and is tailored for the needs of compiler testing. Its commonly used for running tests that are basically shell scripts, often with checks for specific patterns in the output. A typical LIT test may consist of a source code file along with a set of “RUN” commands that specify how to compile, link, or process the file, and what output to expect.

The RUN commands often use ‘FileCheck’, another utility in LLVM, to check the output against expected patterns. In Clang, LIT tests are often used to test frontend features such as parsing, semantic analysis, code generation, and diagnostics. These tests typically look like source code files with embedded comments to indicate how to run the test and what to expect (Which is pretty awesome I have to say!).

Consider this example from ‘clang/test/Sema/attr-unknown.c’:

The example is a typical C source file that can be processed by Clang. LIT’s behavior is controlled by comments within the source text. The first comment (on Line 1) specifies how the test should be executed. Clang should be started with some arguments (-fsyntax-only and -verify). There are also substitutions that begin with the ‘%’ symbol, notably ‘%s’, which is replaced by the source file’s name. LIT will also examine comments beginning with ‘expected-warning’ and ensure that the warnings produced by Clang’s output match the expected values.

We can run the test as follows:

(We run LIT from the build folder because the tool is not included in the installation procedure, and we will learn more about LIT setup and invocation once we move on to the next step: Creating our test clang plug in project, and configuring LIT tests for it. We are going to use almost everything we learned so far!).