Now that we have a basic tokenizer, the next step in the compilation process is parsing into the Abstract Syntax Tree! This is one of the harder areas of the compilation, especially if you’ve never written a parser before. I would strongly recommend reading about recursive descent parsing. Before we talk about the parsing process however, let’s look at the AST.

The Abstract Syntax Tree

The first thing to realize about the abstract syntax tree is that it is simply a cluster of Nodes. Each Node may or may not point to other nodes. It’s fairly easy to represent the AST as a collection of Polymorphic nodes that inherit from the base ASTNode.

struct ASTNode {
        virtual ~ASTNode() = default;
        static auto parse(TokenIterator &it) -> std::unique_ptr<ASTNode>;
        virtual auto codegen() -> Value * = 0;
};

This structure defines our ASTNodes as a base class with two methods, parse, and codegen(). The parse() method will take in an iterator over the Tokens and return a unique pointer to a Node. The codegen() method will output some functional value (LLVM’s Intermediate Representation Values) that we can use to piece together our code.

For example, a Function is composed of two nodes, a Prototype (which needs to generate its own IR) and a Body, which is a block of ASTNodes.

struct FunctionASTNode : public ASTNode {
        std::unique_ptr<PrototypeASTNode> proto;
        std::unique_ptr<BlockASTNode> body;
        FunctionASTNode(std::unique_ptr<PrototypeASTNode> proto,
                        std::unique_ptr<BlockASTNode> body)
                : proto(std::move(proto))
                , body(std::move(body))
        {
        }

        static auto parse(TokenIterator &it) -> std::unique_ptr<ASTNode>;
        auto codegen() -> Value * override;
};

Parsing

By implementing this format, we make the creation of a parser pretty much trivial from the standpoint of actually running “parse”. We just need to define a list of ASTNodes (e.g. a module / program) and parse until we find an EOF token.

auto parseCodeModule(std::vector<Token> &tokens) -> std::unique_ptr<CodeModule> {
        auto module = std::make_unique<CodeModule>();
        auto it = tokens.begin();
        while (it != tokens.end()) {
                module->nodes.push_back(ASTNode::parse(it));
        }

        return module;
}

Now each ASTNode runs through a parse tree starting at the base ASTNode.

auto ASTNode::parse(TokenIterator &it) -> std::unique_ptr<ASTNode>
{
        switch (it->type) {
        case TokenType::Extern:
                return ForeignFunctionASTNode::parse(it);

        case TokenType::Fn:
                return FunctionASTNode::parse(it);

        case TokenType::Eof:
                it++;
                return nullptr;

        default:
                throw std::runtime_error("Unexpected token at " +
                                         std::to_string(it->line) + ":" +
                                         std::to_string(it->column));
        }
}

Each individual ASTNode has a more detailed parse function depending on what was found. For example foreign functions defined by extern fn name(arg: type) rtype; will call a helper method called expect that throws an error and position if a syntactic violation has been committed.

auto expect(TokenIterator &it, TokenType type) -> void
{
        if (it->type != type) {
                throw std::runtime_error("Expected '" + getTokenName(type) +
                                         "' at " + std::to_string(it->line) +
                                         ":" + std::to_string(it->column));
        }
}

auto ForeignFunctionASTNode::parse(elyra::TokenIterator &it) -> std::unique_ptr<ASTNode>
{
        expect(it, TokenType::Extern); // EXTERN
        it++;

        expect(it, TokenType::Fn); // FN
        it++;

        //...
}

This greatly simplifies the parsing process and this is the basis of how recursive descent parsers tend to work. One of the best parts is that you can make very specific errors to be helpful to the user such as Error: Expected '{' at 7:23. These errors will get a bit more difficult in semantic analysis, but at our current stage let’s not worry about that.

Most of the parsing is a bit straight-forward — you’re just trying to fill structures with relevant data and parsing the syntax of the language. I think the most complex area is the code handling binary expressions. Right now we’re only implementing Add and Sub so this doesn’t handle order of operations just yet — but ExprASTNode was extremely difficult to get just right. I can only imagine adding order of operations, explicit parentheses, and more to become even more difficult.

auto ExprASTNode::parse(TokenIterator &it) -> std::unique_ptr<ASTNode>
{
        std::unique_ptr<ASTNode> lhs;

        if(it->type == TokenType::Number) {
                lhs = std::make_unique<NumberASTNode>(std::stoi(it->value));
                it++;
        } else if(it->type == TokenType::Ident) {
                if((it + 1)->type == TokenType::LParen) {
                        lhs = CallASTNode::parse(it);
                } else {
                        lhs = std::make_unique<VariableASTNode>(it->value);
                        it++;
                }
        } else {
                throw std::runtime_error("Unexpected token at " +
                                         std::to_string(it->line) + ":" +
                                         std::to_string(it->column));
        }

        while(it->type == TokenType::Add || it->type == TokenType::Sub) {
                TokenType op = it->type;
                it++;

                std::unique_ptr<ASTNode> rhs;

                if(it->type == TokenType::Number) {
                        rhs = std::make_unique<NumberASTNode>(std::stoi(it->value));
                        it++;
                } else if(it->type == TokenType::Ident) {
                        if((it + 1)->type == TokenType::LParen) {
                                rhs = CallASTNode::parse(it);
                        } else {
                                rhs = std::make_unique<VariableASTNode>(it->value);
                                it++;
                        }
                } else {
                        throw std::runtime_error("Unexpected token at " +
                                                 std::to_string(it->line) + ":" +
                                                 std::to_string(it->column));
                }

                lhs = std::make_unique<BinaryExprASTNode>(op, std::move(lhs), std::move(rhs));
        }

        return lhs;
}

This code is a bit hard to digest and could probably use some degree of abstraction — however the concept is finding a number, variable, or function call and checking the next element to see if it’s a binary operator. If it is, the next section is parsed as the right hand side of the equation. The equation is then combined into a BinaryExprASTNode and then reassigned as the left hand side and checked again to see if there’s any chaining going on. This then allows us to infinitely chain expressions (until you blow up the stack, but that seems unlikely). And at the very end it returns our expression.

Codegen

For now we’re going to export the LLVM context stuff to global space. We’ll create a method for initialization and define the constants.

llvm::LLVMContext context;
llvm::IRBuilder<> builder(context);
std::unique_ptr<llvm::Module> module;
std::unique_ptr<llvm::legacy::FunctionPassManager> fpm;
std::unordered_map<std::string, llvm::Value *> namedValues;

namespace elyra
{

void initContext()
{
        module = std::make_unique<llvm::Module>("elyra", context);
        fpm = std::make_unique<llvm::legacy::FunctionPassManager>(module.get());
        fpm->add(llvm::createInstructionCombiningPass());
        fpm->add(llvm::createReassociatePass());
        fpm->add(llvm::createGVNPass());
        fpm->add(llvm::createCFGSimplificationPass());
        fpm->doInitialization();
}
}

Let’s overview the context. First we need the global LLVM context. The IRBuilder needs to know the context. We then need to create a module and pass manager, alongside of a list of currently named values. This will probably need to be replaced with a Symbol Table in the future to understand and analyze code.

Each individual method then needs to call into the IRBuilder in order to build the LLVM IR.

For example here’s how PrototypeASTNode generates a function signature

auto PrototypeASTNode::codegen() -> llvm::Value *
{
        std::vector<llvm::Type *> argTypes(args.size());
        for (unsigned i = 0; i < args.size(); ++i)
                argTypes[i] = getType(args[i].second);

        llvm::FunctionType *FT = llvm::FunctionType::get(getType(returnType), argTypes, false);
        llvm::Function *F = llvm::Function::Create(
                FT, llvm::Function::ExternalLinkage, name, module.get());

        unsigned Idx = 0;
        for (auto &Arg : F->args())
                Arg.setName(args[Idx++].first);

        return F;
}

We first create an array of types and get the type information for each individual input type in the prototype arguments. After this stage we grab the return type and pass in the argument types to llvm::FunctionType::get() The last argument is if the function is variadic. The next is the creation of the Function Type — which in this case is Externally Linked. This is opposed to InternalLinkage which can allow some optimizations and inlining. Externally Linked code follows the C ABI, which allows us to link our toy language into C programs. The next part then goes through the function arguments and sets the names of variables. It’s not required but nice to have.

After we have a prototype, we can then generate the actual definitions. This consists of grabbing the prototype and then creating a Basic Block in the function with a label.

auto FunctionASTNode::codegen() -> llvm::Value *
{
        llvm::Function *function = module->getFunction(proto->name);

        if (!function) {
                function = static_cast<llvm::Function *>(proto->codegen());
        }

        if (!function) {
                return nullptr;
        }

        llvm::BasicBlock *BB =
                llvm::BasicBlock::Create(context, "entry", function);
        builder.SetInsertPoint(BB);
        namedValues.clear();

        for (auto &Arg : function->args()) {
                namedValues[Arg.getName().data()] = &Arg;
        }

        for (auto &node : body->body) {
                if (auto *expr = node.get()) {
                        expr->codegen();
                }
        }

        llvm::verifyFunction(*function);
        return function;
}

This method creates a function, inserts the named values and then calls codegen inside the Block. Then we call llvm::verifyFunction to find any errors in the IR code.

Once all code generation is done we can export the object file and compile to an output file using the code below.

auto outputObjectFile() -> void
{
        llvm::InitializeAllTargetInfos();
        llvm::InitializeAllTargets();
        llvm::InitializeAllTargetMCs();
        llvm::InitializeAllAsmParsers();
        llvm::InitializeAllAsmPrinters();

        auto targetTriple = llvm::sys::getDefaultTargetTriple();
        module->setTargetTriple(targetTriple);

        std::string Error;
        auto Target = llvm::TargetRegistry::lookupTarget(targetTriple, Error);

        if (!Target) {
                llvm::errs() << Error;
                return;
        }

        auto CPU = "generic";
        auto Features = "";

        llvm::TargetOptions opt;
        auto RM = std::optional<llvm::Reloc::Model>();
        auto TheTargetMachine = Target->createTargetMachine(targetTriple, CPU,
                                                            Features, opt, RM);

        module->setDataLayout(TheTargetMachine->createDataLayout());

        auto Filename = "output.o";
        std::error_code EC;
        llvm::raw_fd_ostream dest(Filename, EC, llvm::sys::fs::OF_None);

        if (EC) {
                llvm::errs() << "Could not open file: " << EC.message();
                return;
        }

        llvm::legacy::PassManager pass;
        auto FileType = llvm::CGFT_ObjectFile;

        if (TheTargetMachine->addPassesToEmitFile(pass, dest, nullptr,
                                                  FileType)) {
                llvm::errs()
                        << "TheTargetMachine can't emit a file of this type";
                return;
        }

        pass.run(*module);
        dest.flush();

        auto ret = system("clang output.o -o output");
        if(ret != 0) {
                llvm::errs() << "Failed to link object file\n";
                return;
        }

        delete TheTargetMachine;
}

Once this is done, we can make a program, compile it, and run it.
Here’s the test program:

extern fn putchar(i: i32) void;

fn main() void {
        putchar(72);
        putchar(105);
        putchar(33);
        putchar(10);
        return 0;
}

And there we go!