From Zero to Brainfuck in Zig (Part 2)
This is Part 2, following on from Part 1.
Table of Contents
Fusing
We have the tokeniser, time to actually do something with the tokens. We need to fuse as many tokens as possible into a single instruction while preserving the order of operations.
Here's the instruction format from the last part as a reminder:
const Inst = struct {
add: i8 = 0,
shift: i32 = 0,
input: u8 = 0,
output: u8 = 0,
jez: ?*Inst = null,
jnz: ?*Inst = null,
};
consumeBlock's API
This parser will need to fill in the add, shift, input and output
fields. jez and jnz will be filled in by the meta-parser.
As we did for consumeToken, we now have to think about the API for consumeBlock.
We will need to take in a cursor: *[]const u8, and we will have to return an ?Inst.
However, when we return an ?Inst it's because we encountered a token that
can't be fused. We know what that token is, and hence it would have advanced
the cursor forward - we'd need to backtrack.
Backtracking would be fine, especially in this language where tokens are just 1 char wide, however if this was a language with varying width tokens (read: pretty much all languages) we'd need additional logic for the amount we have to backtrack. Plus lower perf from duplicate tokenisation.
So, let's not backtrack. Let's just return the token that was consumed from
cursor but not consumed by our function.
const Block = struct {
inst: Inst,
next_tok: ?Token,
};
As we return next_tok, we should also offer a way for the caller to give it
back to us for use before consuming through cursor. We'll do this through a
prepend: ?Token parameter.
fn consumeBlock(cursor: *[]const u8, prepend: ?Token) ?Block { ... }
Testing consumeBlock
To write a test for consumeBlock we need to create a program that covers
our fusion logic.
++- would fuse into .add = 1, this covers consecutive duplicates &
inverses.
> would add a shift into that same Inst (.shift = 1).
A further --- would be in a different Inst as it gets executed top to
bottom and we've already shifted; .add = -3.
<<<> -> .shift = -2,
... -> .output = 3,
, -> in a different Inst, .input = 1 - again, because .input is before
.output and that was set above.
Here's how that test looks like:
test "consumeBlock" {
const expect = std.testing.expectEqual;
const program =
\\++- > -
\\--<<<>...,
;
const expected = .{
Inst{
.add = 1,
.shift = 1,
},
Inst{
.add = -3,
.shift = -2,
.output = 3,
},
Inst{
.input = 1,
},
};
var cursor: []const u8 = program;
var prepend: ?Token = null;
inline for (expected) |inst| {
const res = consumeBlock(&cursor, prepend).?;
try expect(inst, res.inst);
prepend = res.next_tok;
}
inline for (0..69) |_| {
try expect(null, consumeBlock(&cursor, prepend));
}
}
Note we always do prepend = res.next_tok;, that is necessary to not skip
tokens.
consumeBlock
Here's the full function:
fn consumeBlock(cursor: *[]const u8, prepend: ?Token) ?Block {
if (cursor.len == 0 and prepend == null) {
return null;
}
var prepend_once = prepend;
const ParseState = enum { Add, Shift, Input, Output };
var state = ParseState.Add;
var inst = Inst{};
var token: ?Token = null;
while (true) {
token = prepend_once orelse consumeToken(cursor);
prepend_once = null;
switch (token orelse break) {
.Incr => switch (state) {
.Add => inst.add = fallibleAdd(inst.add, 1) orelse break,
else => break,
},
.Decr => switch (state) {
.Add => inst.add = fallibleSub(inst.add, 1) orelse break,
else => break,
},
.Right => switch (state) {
.Add => {
state = ParseState.Shift;
inst.shift = fallibleAdd(inst.shift, 1) orelse break;
},
.Shift => inst.shift = fallibleAdd(inst.shift, 1) orelse break,
else => break,
},
.Left => switch (state) {
.Add => {
state = ParseState.Shift;
inst.shift = fallibleSub(inst.shift, 1) orelse break;
},
.Shift => inst.shift = fallibleSub(inst.shift, 1) orelse break,
else => break,
},
.Input => switch (state) {
.Add, .Shift => {
state = ParseState.Input;
inst.input = fallibleAdd(inst.input, 1) orelse break;
},
.Input => inst.input = fallibleAdd(inst.input, 1) orelse break,
else => break,
},
.Output => switch (state) {
.Add, .Shift, .Input => {
state = ParseState.Output;
inst.output = fallibleAdd(inst.output, 1) orelse break;
},
.Output => inst.output = fallibleAdd(inst.output, 1) orelse break,
},
.BegLoop => break,
.EndLoop => break,
}
}
return Block{ .inst = inst, .next_tok = token };
}
Let's break it down:
if (cursor.len == 0 and prepend == null) {
return null;
}
This isn't necessary, but I figured it's easier for the programmer to understand the function does nothing if you input nothing.
var prepend_once = prepend;
In Zig, parameters are always const. As prepend is an optional, via
type inference prepend_once is also an optional. We will use this
as a "once container" by setting it to null after usage.
const ParseState = enum { Add, Shift, Input, Output };
var state = ParseState.Add;
We define a ParseState type for us to track what the last field we set was.
E.g. a < or > would set state = ParseState.Shift and then we'd know that
we have to return if we encounter a +.
var inst = Inst{};
var token: ?Token = null;
while (true) {
token = prepend_once orelse consumeToken(cursor);
prepend_once = null;
If prepend_once is present, use it; otherwise, consume a token from cursor.
switch (token orelse break) {
If token is null, we break the loop, otherwise we switch on it. Note that
we're switching on the token instead of our parse state.
.Incr => switch (state) {
// If currently adding, increment the `add` field.
.Add => inst.add = fallibleAdd(inst.add, 1) orelse break,
else => break,
},
When we encounter a +, we can only add if we're in the .Add state to
preserve instruction order.
I defined helper functions fallibleAdd and fallibleSub that allows us to
use orelse if addition overflows:
inline fn fallibleAdd(a: anytype, b: anytype) ?@TypeOf(a, b) {
const res, const overflow = @addWithOverflow(a, b);
return if (overflow == 1) null else res;
}
inline fn fallibleSub(a: anytype, b: anytype) ?@TypeOf(a, b) {
const res, const overflow = @subWithOverflow(a, b);
return if (overflow == 1) null else res;
}
The rest of the big switch just advances the state corresponding to the latest token, breaking instead of going to a previous state.
Finally, we return what we parsed and the token that made us break.
return Block{ .inst = inst, .next_tok = token };
Joining Blocks Together
parse's API
We're finally implementing a user facing API: parse.
Couple things to keep in mind:
- Our representation uses pointers, we have to allocate here, so we need
an
allocator - To ensure cache-locality our instructions should be allocated in an Arena,
so our
allocatorshould be a*std.heap.ArenaAllocator - We're going to need to store a stack to keep track of loops, but this is
temporary. As it is a different lifetime and we don't want to trash our
arena allocator, we'll need a
temp_alloc: std.mem.Allocator.
With that laid out, this is the API:
fn parse(program: []const u8, allocator: *ArenaAllocator, temp_alloc: Allocator) ParseError!?*Inst { ... }
Testing parse
The only new logic this function adds is setting Inst's .jez and .jnz.
Here's a test that confirms loops and Blocks chained together:
test "parse" {
const testing = std.testing;
const expect = testing.expectEqual;
const assert = testing.expect;
var inst_alloc = ArenaAllocator.init(testing.allocator);
defer inst_alloc.deinit();
const temp_alloc = testing.allocator;
const res = try parse("++[->++<]--.", &inst_alloc, temp_alloc);
const inst = res.?;
try expect(2, inst.add);
try assert(null != inst.jez);
try assert(null != inst.jnz);
const inner = inst.jnz.?;
const end = inst.jez.?;
try expect(-1, inner.add);
try expect(1, inner.shift);
try assert(null != inner.jnz);
try assert(inner.jnz.? == inner.jez.?);
const inner2 = inner.jnz.?;
try expect(2, inner2.add);
try expect(-1, inner2.shift);
try expect(inner, inner2.jnz.?);
try expect(end, inner2.jez);
try expect(-2, end.add);
try expect(1, end.output);
}
parse
We want our parser to keep consuming Blocks and keep checking
what token made consumeBlock stop consuming tokens.
If next_tok is one of .Incr, .Decr, .Right, .Left, .Input, .Output,
we simply chain the previous Inst to the next Inst. Pass in next_tok
when parsing next Inst.
If next_tok is a .BegLoop, push the current Inst to a stack and
set .jnz to the next Inst. However make sure to not pass in next_tok,
by passing null in the prepend arg we effectively "consume" it, which
is what we want as we're the ones handling this logic.
If next_tok is a .EndLoop, pop the Inst from the stack and set
the current Insts .jnz to the popped Insts .jnz.
Then set both the current Inst and the popped Insts .jez to the
next Inst. Once again consume next_tok.
That's the algorithm, you have the tests set up, probably a good exercise for the reader. The solution is on the Github.
Next Part
We're almost done with the walkthrough and only one part left! When it's released I will post on X and also edit this section to link to it.
Thanks for reading!
- @cachecrab