Know_Your_Compilers
Compiler Stages with Clang
Newbies who have just started their journey in compiler development often have a basic understanding of how a compiler works. They know it isn't just a mysterious black box. But have you ever wondered how each stage in the compiler pipeline looks and works?
In this guide, we’ll break down each stage of the compiler, using Clang as our tool. You’ll see how source code is transformed, step by step, from human-readable code to machine code. This post will walk you through the commands you can use to visualize the output at each stage.
Table of Contents
1.Lexical Analysis (Tokenization)
2.Syntax Analysis (Parsing)
3.Semantic Analysis
4.Intermediate Representation (IR) Generation
5.Optimization
6.Machine Code Generation
7.Object Code Generation
8.Linking
9.Full Example Workflow
Compilation Stages with Clang
1. Lexical Analysis (Tokenization)
The first stage of compilation is lexical analysis, where the source code is broken down into tokens. These tokens are the smallest elements like keywords, identifiers, literals, and operators.
To see what your code looks like at this stage, use the following command with Clang:
clang -fsyntax-only -Xclang -dump-tokens <source_file.c>
Example:
Let’s take a simple example.c
file:
int x = 5 + 2;
Run the following command:
clang -fsyntax-only -Xclang -dump-tokens example.c
Output:
int 'int' [StartOfLine] Loc=<example.c:1:1>
identifier 'x' [LeadingSpace] Loc=<example.c:1:5>
equal '=' [LeadingSpace] Loc=<example.c:1:7>
numeric_constant '5' [LeadingSpace] Loc=<example.c:1:9>
plus '+' [LeadingSpace] Loc=<example.c:1:11>
numeric_constant '2' [LeadingSpace] Loc=<example.c:1:13>
semi ';' Loc=<example.c:1:14>
eof '' Loc=<example.c:2:1>
Here, the code has been split into tokens such as int
, identifier
, and numeric_constant
.
2. Syntax Analysis (Parsing)
Next is syntax analysis, where the compiler builds an Abstract Syntax Tree (AST) from the tokens. The AST represents the hierarchical structure of the program, ensuring that the code follows the language's grammatical rules.
To generate and view the AST in Clang, run:
clang -Xclang -ast-dump -fsyntax-only <source_file.c>
Example:
For the same example.c
file:
clang -Xclang -ast-dump -fsyntax-only example.c
Output (simplified):
TranslationUnitDecl 0x13e838808 <<invalid sloc>> <invalid sloc>
|-TypedefDecl 0x13e839338 <<invalid sloc>> <invalid sloc> implicit __NSConstantString 'struct __NSConstantString_tag'
| `-RecordType 0x13e839110 'struct __NSConstantString_tag'
| `-Record 0x13e839088 '__NSConstantString_tag'
|-TypedefDecl 0x13e8393a8 <<invalid sloc>> <invalid sloc> implicit __builtin_va_list 'void *'
| `-PointerType 0x13e839020 'void *'
| `-BuiltinType 0x13e838860 'void'
`-VarDecl 0x13e839418 <example.c:1:1, col:13> col:5 x 'int' cinit
`-BinaryOperator 0x13e839508 <col:9, col:13> 'int' '+'
|-IntegerLiteral 0x13e8394c8 <col:9> 'int' 5
`-IntegerLiteral 0x13e8394e8 <col:13> 'int' 2
In this output, VarDecl
represents the variable declaration (int x
), and BinaryOperator
shows the addition operation (5 + 2
).
3. Semantic Analysis
During semantic analysis, the compiler checks for meaning-related errors, such as type mismatches or invalid operations. Clang automatically performs this check and will emit errors or warnings.
To check for semantic issues, use:
clang -fsyntax-only -Wall <source_file.c>
Example:
clang -fsyntax-only -Wall example.c
Clang will print any semantic errors or warnings it detects during this stage.
4. Intermediate Representation (IR) Generation
In this stage, the compiler generates an Intermediate Representation (IR), which is a lower-level, architecture-independent form of your code.
To view the LLVM IR generated by Clang, use:
clang -S -emit-llvm <source_file.c> -o <output_file.ll>
Example:
clang -S -emit-llvm example.c -o example.ll
Output (example.ll
):
@x = common global i32 0, align 4
define dso_local i32 @main() {
%1 = alloca i32, align 4
store i32 5, i32* %1, align 4
ret i32 0
}
This output represents the code in LLVM's intermediate form, which is platform-agnostic and suitable for further optimization and transformation.
5. Optimization
Once IR is generated, the compiler applies various optimizations to make the code more efficient. Clang allows you to apply different optimization levels, like -O1
, -O2
, or -O3
.
To generate optimized LLVM IR, run:
clang -S -emit-llvm -O2 <source_file.c> -o <output_file_opt.ll>
Example:
clang -S -emit-llvm -O2 example.c -o example_opt.ll
Compare the optimized IR with the unoptimized IR to see the transformations that the compiler applied.
6. Machine Code Generation
At this stage, the compiler converts the IR into machine code (assembly code), specific to the target architecture.
To generate the assembly code, use:
clang -S <source_file.c> -o <output_file.s>
Example:
clang -S example.c -o example.s
Output (example.s
):
.section __TEXT,__text,regular,pure_instructions
.macosx_version_min 10, 19
.syntax unified
.section __DATA,__data
.globl _x @ @x
.p2align 2
_x:
.long 7 @ 0x7
.subsections_via_symbols
Explanation
a> .section __TEXT,__text,regular,pure_instructions
Declares the text section in the binary, where the executable instructions (code) reside. The section name __TEXT,__text is specific to Mach-O format used by macOS. It corresponds to the code segment in the binary. regular, pure_instructions are attributes that describe this section: regular: No special attributes, just normal instructions. pure_instructions: This section contains only code and no mixed data.
b> .macosx_version_min 10, 19
Specifies the minimum macOS version required to run the binary. In this case, it’s macOS 10.19. This directive ensures compatibility with specific OS versions by embedding this version information in the binary.
c> .syntax unified
Specifies the use of unified assembly syntax, which is common in certain architectures like ARM. In this case, it may not directly affect x86 assembly but is a general practice for assembly language compatibility across architectures.
d> .section __DATA,__data
Declares the data section where global or static variables are stored. In macOS, __DATA,__data is used for variables or initialized data in the Mach-O binary format.
e> .globl _x
Declares _x as a global symbol. Global symbols can be accessed across different files or linked externally. In this case, _x refers to a global variable that can be accessed by other files or functions.
f> .p2align 2
Aligns the next data (in this case, _x) to a 4-byte boundary. The argument 2 means alignment to 2^2 = 4 bytes. Alignment ensures better memory access performance for variables or instructions.
g> _x:
This is a label representing the location of the global variable x. The value for x will be stored in memory at this location.
h> .long 7
This instruction stores the value 7 as a 4-byte (32-bit) integer at the location labeled _x. In hexadecimal, 7 is represented as 0x7. It corresponds to int x = 7; in C.
i>. .subsections_via_symbols
A Mach-O-specific directive that allows subsections of this section to be divided based on symbols. It helps optimize linking by allowing the linker to discard unused subsections if they are not referenced.
Summary
This assembly code declares a global variable x in the data section, initialized to 7. It also sets up necessary alignment and version information for the binary, ensuring compatibility with macOS. The .subsections_via_symbols allows fine-grained linking optimizations in macOS binaries.
This is the human-readable machine code in assembly language.
7. Object Code Generation
Next, the compiler generates object code, which is binary machine code that is not yet linked into an executable.
To generate the object file, run:
clang -c <source_file.c> -o <output_file.o>
Example:
clang -c example.c -o example.o
The output is a binary object file (example.o
), which contains the machine code.
8. Linking
The final stage is linking, where the object files and libraries are combined to produce an executable.
To create the executable file, run:
clang <source_file.c> -o <executable_name>
Example:
clang example.c -o example
This will create an executable file named example
.
Full Example Workflow
Here’s a quick recap of all the commands to walk through the full process of compiling a simple C program:
Tokenization:
clang -fsyntax-only -Xclang -dump-tokens example.c
AST Generation:
clang -Xclang -ast-dump -fsyntax-only example.c
LLVM IR Generation:
clang -S -emit-llvm example.c -o example.ll
Optimized LLVM IR:
clang -S -emit-llvm -O2 example.c -o example_opt.ll
Assembly Generation:
clang -S example.c -o example.s
Object Code Generation:
clang -c example.c -o example.o
Executable Generation:
clang example.c -o example
Conclusion
This tutorial has shown how each stage of the compiler pipeline transforms your source code, from tokenization to executable generation. By using Clang's tools and commands, you can peek under the hood of the compiler and understand how your code evolves from a high-level language into machine code. Whether you are a beginner or someone looking to gain a deeper understanding of compilers, these insights can help you debug and optimize your code more effectively.