From Zero to Brainfuck in Zig (Part 1)

16 Jul, 2024

This article is the first part of a full code breakdown of my first ever Zig project.

I will be explaining exactly what is happening and how this project uses Zig's features.

Motivation
Structure
Tokenisation
Part 2

Motivation

I wanted to experiment with a different Brainfuck instruction representation and its effect on performance. It looks like so:

const Inst = struct {
    // Each BF cell is a wrapping u8 so don't need to store any high values
    add: i8 = 0,
    shift: i32 = 0,
    input: u8 = 0,
    output: u8 = 0,
    jez: ?*Inst = null,
    jnz: ?*Inst = null,
};

The theory behind it is that we fuse as many instructions as possible into this single struct. The order they will be executed is from the first field down, sequentially.

The benefits I had in mind:

No need to switch (kind), everything is in the same struct
Features run-length encoding so we can apply operations in bulk
No need to keep a stack for loops during runtime, we transform loops into jez and jnz, loops can be implemented with a branchless cmov.

Structure

The structure of a project determines how much you learn.

The main things I wanted to learn were enums & errors, new control flow constructs, and testing.

I figured the perfect way to go through all of them was to structure the project like so:

Lazy tokeniser
- Facilitates single-pass parsing
Basic state machine parser
- Won't handle jumps
- Has a single responsibility to fuse tokens
Meta-parser, wraps the above parser
- Handles jumps
- Again, single responbility
Only expose parse and interpret APIs

This structure also tailors itself well to test-driven development (TDD). The less responsibilities a function does the less complicated tests we need :)

Tokenisation

First, we define the Brainfuck tokens that our tokeniser will return:

const Token = enum {
    Incr,
    Decr,
    Right,
    Left,
    BegLoop,
    EndLoop,
    Input,
    Output,
};

Our tokeniser is lazy, which means that on each function call it returns the next token it encounters. Just one. And none if there aren't any left. From this description we can figure out that we want to return a ?Token.

We also need to think about the inputs to this function. It clearly needs state because it needs to know where it left off - maybe we can do something like this?

const Tokeniser = struct {
    some_state_var: ...,

    fn init(...) Tokeniser { ... }
    fn next(self: *Tokeniser) ?Token { ... }
};

No! There is no need! All we need is a single function that takes in []const u8 and return an update to that slice along with ?Token.

However, to simplify this even more, we can just return ?Token and instead modify the slice passed in - so to modify it, we need to make it a *[]const u8. We will call this a cursor.

fn consumeToken(cursor: *[]const u8) ?Token { ... }

This is the most minimal form (for non-sentinel ending). For example, if you want to read the file contents from a reader you can make a wrapper and pass in different slices.

The state will be encoded in the slice by advancing the beginning for everything we loop over.

Testing First

The best thing about writing the tests before defining function bodies is that we can do a sanity/taste test by looking at how our usage code looks.

test "consumeToken" {
    const expect = std.testing.expectEqual;

    // For each token, test skipping different amounts of "comments",
    // including no token following the last comment
    const program =
        \\+ + a+
        \\- - b-
        \\> > c>
        \\< < d<
        \\[ [ e[
        \\] ] f]
        \\, , g,
        \\. . h.
        \\whatever else
    ;
    var cursor: []const u8 = program;

    const lines: [8]Token = .{ .Incr, .Decr, .Right, .Left, .BegLoop, .EndLoop, .Input, .Output };
    inline for (lines) |expected| {
        inline for (0..3) |_| {
            try expect(expected, consumeToken(&cursor));
        }
    }

    // Just to be sure
    inline for (0..69) |_| {
        try expect(null, consumeToken(&cursor));
    }
}

\\ at the beginning of the lines mean it's a multi-line string.

inline for means the for loop gets unrolled at comptime. It basically copy-pastes the body of every iteration. This is technically unnecessary here and can bloat the binary and compile times, but for this small program it's negligable.

Now, when we implement consumeToken, we can run zig build test and see if our implementation is correct.

As for the API usage, it's extremely clean and minimal. Passes the taste test.

consumeToken

fn consumeToken(cursor: *[]const u8) ?Token {
    var copy = cursor.*;
    defer cursor.* = copy;

    while (copy.len > 0) {
        defer copy = copy[1..];

        switch (copy[0]) {
            '+' => return .Incr,
            '-' => return .Decr,
            '>' => return .Right,
            '<' => return .Left,
            '[' => return .BegLoop,
            ']' => return .EndLoop,
            ',' => return .Input,
            '.' => return .Output,
            else => {},
        }
    }

    return null;
}

Let's break it down:

var copy = cursor.*;
defer cursor.* = copy;

We want to make sure that we update the cursor for the user when we "consume", so why not just access & modify it directly?

It's because it would be an unnecessary double pointer indirection within a loop. A slice (e.g. []const u8) contains two fields: ptr and len. Now put that behind a pointer and you'd have to do slice.*.ptr.* to access a const u8.

So, conveniently, we can remove a layer of indirection by making a copy. And, even more conveniently, Zig allows us to guarantee that cursor will be updated before the function returns via the defer directive. All we need to do now is to update copy and the rest will be taken care of for us.

while (copy.len > 0) {
    defer copy = copy[1..];

    switch (copy[0]) {
        '+' => return .Incr,
        '-' => return .Decr,
        '>' => return .Right,
        '<' => return .Left,
        '[' => return .BegLoop,
        ']' => return .EndLoop,
        ',' => return .Input,
        '.' => return .Output,
        else => {},
    }
}

Here, we loop until we run out of bytes.

Continuation expressions wouldn't get run after a break or a return. We want to consume before we return, but we don't want code duplication, so we use defer.

It's worth mentioning that defers in Zig aren't like defers in Go. They don't queue up at runtime and run at function return, they are statically appended to the end of the scope they were invoked. However, like in Go, they execute reversed from the order they were invoked.

These facts guarantee in two ways that copy will be updated before it's copied back to cursor.

return null;

If we got here, it means our copy is now an empty slice, it is impossible for more tokens to be returned. We can return null with zero usage worries as optional types are first-class in Zig :)

When we run zig build test, it passes.

Part 2

Part 2 is available here.

Thanks for reading!

- @cachecrab

cache crab