Notes/Primer on Clang Compiler Frontend (2) : The Clang AST
Notes/Primer on Clang Compiler Frontend: The Clang AST
These are my notes on chapter 3 of the Clang Compiler Frontend by Ivan Murashko. (I’ve referened this book extensively, and a lot of the snippets here are from this book. I’d highly recommend buying it for a deeper dive: https://www.amazon.com/Clang-Compiler-Frontend-Understand-internals/dp/1837630984)
So far we discussed LLVM, Clang, and Clang’s Backend; now we are going to discuss Clang’s Abstract Syntax Tree. So Clang offers sophisticated tools for searching AST Nodes, these tools are implemented using Domain Specific Language (DSL for short).
We will cover the following topics:
- Basic Blocks used to construct the AST
- How the AST can be traversed.
- The Recursive Visitor
- AST matchers
- Clang-query
- Compilation Errors’ Impact on AST
Basic Blocks used to construct the AST
The AST is usually depicted as a tree (But it’s not implemented as a true tree! The presence of backward edges makes it more like a graph than a tree), with its leaf nodes corresponding to various objects ( such as function declarations and loop bodies). Typically the AST represents the result of syntax analysis i.e., parsing. Clang’s AST nodes were designed to be immutable. Since the design requires that the Clang AST Store results from semantic analysis, the Clang AST represents the outcomes of both syntax and semantic analyses.
Typical tree structure implemented in C++ has all nodes derived from a base class. Clang uses a different approach. It splits different C++ constructions into separate groups with basic classes for each of them:
Statements: ‘clang::Stmt’ is the basic class for all statements, like ‘if’ statements ‘clang::ifStmt class’ as well as expressions and other C++ constructions.
Declarations: ‘clang::Decl’ is the base class for declarations. This includes variables, typedef, function, struct, etc.. There is also a separate base class for declarations with context (Declarations that might contain other declarations). The class is called ‘clang::DeclContext’ and can be accessed using the ‘clang::DeclContext::decls’ method. Translation units ‘clang::TranslationUnitDecl class’ and namespaces ‘clang::NamespaceDecl class’ are examples of declarations with context.
Types:
C++ is a statically typed language, which means Variable Types have to be declared at compile time, Which makes Types an important part of semantic analysis. ‘clang::Type’ is the basic class for types in Clang.
Types might have qualifiers (Think ‘const’ and ‘volatile’) and Clang has a special class to support a type with a qualifier ‘clang::QualType’, which is a pair of a pointer to ‘clang::Type’ and a bit mask with information about the type qualifier. The class has a method to retrieve a pointer to the clang::Type and check different qualifiers. (Note: clang::QualType has ‘operator->()’ and ‘operator*()’ implemented, that is it can be considered as a smart pointer for the underlying ‘clang::Type’ class.
The Type can also have additional information that represents different memory address models. For example, there can be a pointer to an object or reference.
- clang::Type::isPointerType() for pointer type check
- clang::Type::isReferenceType() for reference type check.
Types in C/C++ can also use aliases, using the ‘typedef’ or ‘using’ keywords.
Original types (‘int’ for example) are called canonical. You can check whether a type is canonical or not using ‘clang::QualType::isCanonical()’, you can also retrieve the canonical type from an alias using ‘clang::QualType::getCanonicalType()’.
Now let’s talk about AST traversal!
AST traversal
The design of the AST should prioritize facilitating easy tree traversal. Clang takes a different approach to organizing the tree: The AST nodes don’t share a common ancestor class. So how is tree traversal organized in Clang? Three unique techniques:
- The Curiously Recurring Template Pattern (CRTP) for visitor class definition.
- Ad hoc methods tailored specifically for different nodes.
- Macros, which can be perceived as the connecting layer between the ad hoc methods and CRTP
We will use this program to explore AST traversal:
But first, let’s build our test tool. We are going to change ‘clang::DeclVisitor’, which aids in the creation of visitors for C/C++ declarations. We are going to be using the same CMake file from before, except we are adding the ‘clangAST’ library.
This is the Declvisitor tool, we can see that we use a custom frontend action specific to our project: ‘clangbook::declvisitor::FrontendAction’
This is the code for ‘FrontendAction.hpp’:
We have overridden the ‘CreateASTConsumer’ function from the ‘clang::ASTFrontendAction’ class to instantiate an object of our custom AST consumer class ‘consumer’.
This is the code for our consumer class, ‘Consumer.hpp’:
We can see that we created a sample visitor and invoked it using an overridden method ‘HandleTranslationUnit’ from the ‘clang::ASTConsumer’ class.
And here is the ‘Visitor.hpp’ file:
We can compile our program using the same commands from before:
Now we can use the program we mentioned before to explore the AST traversal:
Now let’s investigate the ‘Visitor’ class implementation in detail
Firstly, you will notice that we are deriving a class from a base class and using this derived class as the parameter for our base class… What?
This is a construct known as the Curiously Recurring Template Pattern, or CRTP for short**.** This allows Visitor to inherit functionality from DeclVisitor that requires DeclVisitor to know the derived class type (Allowing us to resolve which method to call at compile-time instead of run-time, which would yield better performance than the computationally expensive virtual functions)
The Visitor class has several callbacks that are triggered when a corresponding AST node is encountered, the first callback targets the AST node representing a function declaration:
The function name is printed, then we utilize the ‘parameters()’ method from the ‘clang::FunctionDecl’ class, which is an ad hoc approach for AST traversal. Since each AST node provides its own methods to access child nodes, we have an AST node of type Function declaration here, so we call the Visit method of base class ‘clang::DeclVisitor’.
This call is then converts into another callback, for the’ clang::ParmVarDecl’ AST node:
So how is this conversion achieved? A combination of CRTP and C/C++ macros. We can see this in the ‘Visit()’ method implementation of the clang::DeclVisitor<> class.
The code showcases the use of the CRTP in Clang, in this context, CRTP is employed to redirect back to our ‘Visitor’ class, which is referenced as ‘ImplClass’.
The Code was generated using C/C++ macros, as demonstrated here:
‘NAME’ is replaced with the node name ‘ParmVarDecl’.
‘DeclVisitor’ is used to traverse C++ declarations, ‘StmtVisitor’ for statements, and ‘TypeVisitor’ for types. However, these visitors come with several issues. They can only be used with specific groups of AST nodes, ‘DeclVisitor’ with ‘Decl’ class descendants for example. We are also required to implement recursion, which makes it possible to miss some parts of the recursion, for example, our code will not function correctly if the ‘max’ function declaration is specified inside a namespace, Which means we need to implement an additional visit method specifically for namespace declarations.
These issues are addressed by the recursive visitor.
Recursive AST Visitor
We will create the same program, but with a recursive visitor this time.
We will use the same CMake file but add the ‘clangAST’ library.
The main function:
Here is the custom FrontendAction:
We have overridden the ‘CreateASTConsumer’ function to instantiate an object of our custom AST consumer class ‘Consumer’.
Here is the implementation for that class:
And Here is the Visitor:
There are several changes compared to the code for ‘DeclVisitor’: The first is that recursion isn’t implemented, we’ve only implemented the callbacks that we need. So how is recursion controlled? Our callbacks now return a boolean result. It returns ‘false’ if the recursion should stop, and ‘true’ if the visitor should continue the traversal.
These techniques are not the only ways for AST Traversal, most of the tools that we will consider later will use a different approach based on..
AST Matchers
Ast Matchers provide another approach for locating specific AST nodes. They are particularly useful in linters when searching for improper pattern usage or in refactoring tools when identifying AST nodes for modification.
We will create a simple program to test AST matchers. The program will identify a function definition with the name ‘max’. We will use a slightly modified CMake file.
This is the main function for our tool:
As you can see, we employ the ‘MatchFinder’ class and define a custom callback that outlines the specific AST node we intend to match.
This is the implementation of the callback:
We can see that each matcher is identified by an ID ‘match-id’, the matcher is defined here:
The matcher seeks a function declaration that has a specific name, using ‘functionDecl()’ as seen in ‘matchesName()’. Here, we utilized a Domain-Specific-Language (DSL) to specific the matcher. The DSL is implemented using C++ macros.
The DSL for matchers is typically employed in custom Clang tools, such as clang-tidy, but it can be used as a standalone tool.
‘Clang-query’ can be used to search for specific AST nodes in analyzed C++ code.
You can build and install this tall using:
We are going to test it on this file, let’s call it ‘minmax’:
We can run the tool as follows:
This is the default output, referred to as ‘diag’, but an important output is ‘dump’, which will display the located AST node, which will be useful to investigate compilation errors.
Clang Error Compilation
Error Processing in clang encompasses error detection, displaying error messages and potential error recovery. Error Recovery is particularly interesting: it occurs when Clang doesn’t halt upon encountering a compilation error but continues to compile to detect additional issues!
This is beneficial for various reasons:
- It allows us to be informed about as many errors as possible in a single compilation run.
- IDEs offer navigation support coupled with an integrated compiler. We will explore ‘clangd’ as one such tool. Most errors are confined to specific sections of the code, and it might be suboptimal to cease navigation in such cases
Clang employs various techniques for error recovery. For the syntax stage of parsing it utilizes heuristics (For example, if a user forgets a semicolon, Clang may attempt to add it)
The recovery phase can be abbreviated as DIRT:
- D for Delete a character (an extra semicolon for example)
- I for Insert a character (see above)
- R for Replace (which replaces a character to match a particular token)
- T for Transpose (rearranging two characters to match a token).
Clang performs full recovery if possible, but when full recovery isn’t possible, Clang implements unique techniques to manage recovery while AST is created.
Consider the program below which is syntactically correct but has a semantic error:
Let’s examine the AST result produced by Clang using ‘clang-query’:
Lets setup the output to produce the AST and search for return statements that might be helpful:
We identify two return statements:
As we can see the second match has a compilation error, Clang has inserted a special type of AST node: ‘RecoveryExpr’. (It’s worth noting that in certain situations, Clang might produce an incomplete AST. which can cause issues with Clang tools, such as clang-tidy lint, which we will explore later).
In the next chapter we will discuss the basic libraries used in Clang/LLVM development.