cache crab

From Zero to Brainfuck in Zig (Part 2)

This is Part 2, following on from Part 1.

Table of Contents

  1. Fusing
  2. Joining Blocks Together
  3. Next Part

Fusing

We have the tokeniser, time to actually do something with the tokens. We need to fuse as many tokens as possible into a single instruction while preserving the order of operations.

Here's the instruction format from the last part as a reminder:

const Inst = struct {
    add: i8 = 0,
    shift: i32 = 0,
    input: u8 = 0,
    output: u8 = 0,
    jez: ?*Inst = null,
    jnz: ?*Inst = null,
};

consumeBlock's API

This parser will need to fill in the add, shift, input and output fields. jez and jnz will be filled in by the meta-parser.

As we did for consumeToken, we now have to think about the API for consumeBlock. We will need to take in a cursor: *[]const u8, and we will have to return an ?Inst.

However, when we return an ?Inst it's because we encountered a token that can't be fused. We know what that token is, and hence it would have advanced the cursor forward - we'd need to backtrack.

Backtracking would be fine, especially in this language where tokens are just 1 char wide, however if this was a language with varying width tokens (read: pretty much all languages) we'd need additional logic for the amount we have to backtrack. Plus lower perf from duplicate tokenisation.

So, let's not backtrack. Let's just return the token that was consumed from cursor but not consumed by our function.

const Block = struct {
    inst: Inst,
    next_tok: ?Token,
};

As we return next_tok, we should also offer a way for the caller to give it back to us for use before consuming through cursor. We'll do this through a prepend: ?Token parameter.

fn consumeBlock(cursor: *[]const u8, prepend: ?Token) ?Block { ... }

Testing consumeBlock

To write a test for consumeBlock we need to create a program that covers our fusion logic.

++- would fuse into .add = 1, this covers consecutive duplicates & inverses.

> would add a shift into that same Inst (.shift = 1).

A further --- would be in a different Inst as it gets executed top to bottom and we've already shifted; .add = -3.

<<<> -> .shift = -2,

... -> .output = 3,

, -> in a different Inst, .input = 1 - again, because .input is before .output and that was set above.

Here's how that test looks like:

test "consumeBlock" {
    const expect = std.testing.expectEqual;

    const program =
        \\++- > -
        \\--<<<>...,
    ;

    const expected = .{
        Inst{
            .add = 1,
            .shift = 1,
        },
        Inst{
            .add = -3,
            .shift = -2,
            .output = 3,
        },
        Inst{
            .input = 1,
        },
    };

    var cursor: []const u8 = program;
    var prepend: ?Token = null;

    inline for (expected) |inst| {
        const res = consumeBlock(&cursor, prepend).?;
        try expect(inst, res.inst);
        prepend = res.next_tok;
    }

    inline for (0..69) |_| {
        try expect(null, consumeBlock(&cursor, prepend));
    }
}

Note we always do prepend = res.next_tok;, that is necessary to not skip tokens.

consumeBlock

Here's the full function:

fn consumeBlock(cursor: *[]const u8, prepend: ?Token) ?Block {
    if (cursor.len == 0 and prepend == null) {
        return null;
    }

    var prepend_once = prepend;

    const ParseState = enum { Add, Shift, Input, Output };

    var state = ParseState.Add;

    var inst = Inst{};
    var token: ?Token = null;

    while (true) {
        token = prepend_once orelse consumeToken(cursor);
        prepend_once = null;

        switch (token orelse break) {
            .Incr => switch (state) {
                .Add => inst.add = fallibleAdd(inst.add, 1) orelse break,
                else => break,
            },
            .Decr => switch (state) {
                .Add => inst.add = fallibleSub(inst.add, 1) orelse break,
                else => break,
            },
            .Right => switch (state) {
                .Add => {
                    state = ParseState.Shift;
                    inst.shift = fallibleAdd(inst.shift, 1) orelse break;
                },
                .Shift => inst.shift = fallibleAdd(inst.shift, 1) orelse break,
                else => break,
            },
            .Left => switch (state) {
                .Add => {
                    state = ParseState.Shift;
                    inst.shift = fallibleSub(inst.shift, 1) orelse break;
                },
                .Shift => inst.shift = fallibleSub(inst.shift, 1) orelse break,
                else => break,
            },
            .Input => switch (state) {
                .Add, .Shift => {
                    state = ParseState.Input;
                    inst.input = fallibleAdd(inst.input, 1) orelse break;
                },
                .Input => inst.input = fallibleAdd(inst.input, 1) orelse break,
                else => break,
            },
            .Output => switch (state) {
                .Add, .Shift, .Input => {
                    state = ParseState.Output;
                    inst.output = fallibleAdd(inst.output, 1) orelse break;
                },
                .Output => inst.output = fallibleAdd(inst.output, 1) orelse break,
            },
            .BegLoop => break,
            .EndLoop => break,
        }
    }

    return Block{ .inst = inst, .next_tok = token };
}

Let's break it down:

if (cursor.len == 0 and prepend == null) {
    return null;
}

This isn't necessary, but I figured it's easier for the programmer to understand the function does nothing if you input nothing.

var prepend_once = prepend;

In Zig, parameters are always const. As prepend is an optional, via type inference prepend_once is also an optional. We will use this as a "once container" by setting it to null after usage.

const ParseState = enum { Add, Shift, Input, Output };

var state = ParseState.Add;

We define a ParseState type for us to track what the last field we set was. E.g. a < or > would set state = ParseState.Shift and then we'd know that we have to return if we encounter a +.

var inst = Inst{};
var token: ?Token = null;

while (true) {
    token = prepend_once orelse consumeToken(cursor);
    prepend_once = null;

If prepend_once is present, use it; otherwise, consume a token from cursor.

switch (token orelse break) {

If token is null, we break the loop, otherwise we switch on it. Note that we're switching on the token instead of our parse state.

.Incr => switch (state) {
    // If currently adding, increment the `add` field.
    .Add => inst.add = fallibleAdd(inst.add, 1) orelse break,
    else => break,
},

When we encounter a +, we can only add if we're in the .Add state to preserve instruction order.

I defined helper functions fallibleAdd and fallibleSub that allows us to use orelse if addition overflows:

inline fn fallibleAdd(a: anytype, b: anytype) ?@TypeOf(a, b) {
    const res, const overflow = @addWithOverflow(a, b);
    return if (overflow == 1) null else res;
}

inline fn fallibleSub(a: anytype, b: anytype) ?@TypeOf(a, b) {
    const res, const overflow = @subWithOverflow(a, b);
    return if (overflow == 1) null else res;
}

The rest of the big switch just advances the state corresponding to the latest token, breaking instead of going to a previous state.

Finally, we return what we parsed and the token that made us break.

return Block{ .inst = inst, .next_tok = token };

Joining Blocks Together

parse's API

We're finally implementing a user facing API: parse.

Couple things to keep in mind:

With that laid out, this is the API:

fn parse(program: []const u8, allocator: *ArenaAllocator, temp_alloc: Allocator) ParseError!?*Inst { ... }

Testing parse

The only new logic this function adds is setting Inst's .jez and .jnz.

Here's a test that confirms loops and Blocks chained together:

test "parse" {
    const testing = std.testing;

    const expect = testing.expectEqual;
    const assert = testing.expect;

    var inst_alloc = ArenaAllocator.init(testing.allocator);
    defer inst_alloc.deinit();

    const temp_alloc = testing.allocator;

    const res = try parse("++[->++<]--.", &inst_alloc, temp_alloc);
    const inst = res.?;

    try expect(2, inst.add);
    try assert(null != inst.jez);
    try assert(null != inst.jnz);

    const inner = inst.jnz.?;
    const end = inst.jez.?;

    try expect(-1, inner.add);
    try expect(1, inner.shift);
    try assert(null != inner.jnz);
    try assert(inner.jnz.? == inner.jez.?);

    const inner2 = inner.jnz.?;
    try expect(2, inner2.add);
    try expect(-1, inner2.shift);
    try expect(inner, inner2.jnz.?);
    try expect(end, inner2.jez);

    try expect(-2, end.add);
    try expect(1, end.output);
}

parse

We want our parser to keep consuming Blocks and keep checking what token made consumeBlock stop consuming tokens.

If next_tok is one of .Incr, .Decr, .Right, .Left, .Input, .Output, we simply chain the previous Inst to the next Inst. Pass in next_tok when parsing next Inst.

If next_tok is a .BegLoop, push the current Inst to a stack and set .jnz to the next Inst. However make sure to not pass in next_tok, by passing null in the prepend arg we effectively "consume" it, which is what we want as we're the ones handling this logic.

If next_tok is a .EndLoop, pop the Inst from the stack and set the current Insts .jnz to the popped Insts .jnz. Then set both the current Inst and the popped Insts .jez to the next Inst. Once again consume next_tok.

That's the algorithm, you have the tests set up, probably a good exercise for the reader. The solution is on the Github.

Next Part

We're almost done with the walkthrough and only one part left! When it's released I will post on X and also edit this section to link to it.

Thanks for reading!

- @cachecrab