# nand2tetris, Part 2.1

2022-08-25

Last week I completed nand2tetris part 1 and wrote [a blogpost about
it](../nand2tetris_1). It was great fun; I like how I built a computer
from NAND gates. Without missing a beat I went on to work on the sequel:
nand2tetris, Part 2.

So we have a computer. Now what? Hang it on the wall for the lolz? Of
course not — let's force the emulation of silicon to think.

Again, this course expects you to finish 6 projects (actually 7 but
"module 0" is actually a replica of project 04 from part 1). As of the
time of writing, I've finished two of them. Over time I discovered that
these two projects alone are already complicated enough. If I were to wait
until I finish all six, I won't have the motivation to cover them in one
single monolithic blogpost. I find it more sensible to mark a checkpoint,
write a blogpost, then carry on doing project 09.

__SPOILER WARNING: this blogpost contains my solutions to many challenges.
If you haven't taken the course but wish to, please do it before reading
this.__

## Stack machine

Let's pull out an inventory of things we made in project 06:

- A computer built on the HACK architecture
- HACK assembly language
- An assembler from HACK assembly to machine code

What do we do now? Write Tetris in assembly? Of course not. We need
a high-level language and a compiler. However, the authors find the curve
too steep from assembly to a high-level language. Therefore, we are
introduced an abstraction: the __stack machine__, also called virtual
machine (VM). Our ultimate task in project 07 and 08 is to write a VM
translator that translates VM commands into one single assembly file,
where we emulate the stack machine on HACK hardware.

What's a stack? Sure, it's a kind of data structure, but right now we are
talking about a bunch of contiguous words in RAM and a pointer to the top.
The stack starts at RAM[256] and goes on until 2047. RAM[0], which in our
assembler is also called SP, contains the pointer to the top of our stack
(although technically, it is the address of the topmost word plus one;
more on that later). Initially, since the stack is empty, SP = 256.

Of course, our machine is much more than that, even in our first two
projects. Our focus in this blogpost is to understand how the stack
machine works, how it helps our computer be a computer, and how we
implement it using HACK assembly.

Our complete RAM is charted as follows. It'll prove useful sooner or
later.

![](img/nand2tetris_2.1/ram_areas.png)

Nomenclature: Except in assembly code, the symbols SP/LCL/etc. refer to
the value stored in RAM[0]/RAM[1]/etc., and an asterisk followed by
a symbol (with or without an offset) dereferences it in C-style, e.g.
\*(LCL+1) refers to RAM[LCL+1], i.e. RAM[RAM[1] + 1].

## Memory segments

You can do many things to a stack, but the most elementary operations are
push and pop. Let's say you want to push a number, say 420, onto the
stack. All you need to do is:

- read SP
- go to RAM[SP]
- write 420
- increment SP by one

The VM command for this is `push constant 420`. You can repeat this as
many times as you please, and SP will keep updated.

![SP=263, stack: 420 (7 times)](img/nand2tetris_2.1/stack_push.png)

What if we need to push some non-constant number, like RAM[420]? In this
case, the VM implementation imposes some restrictions. You cannot just ask
for any memory in RAM; you have to specify one of eight "memory segments".
Four of them have something to do with the four companions of SP: LCL,
ARG, THIS, and THAT. In project 06, their purposes were not clearly
stated, but they're actually pointers to useful chunks of memory at
a certain point in the program. To push them onto the stack, you follow
these pointers.

Suppose LCL (RAM[1]) is 420, and I run this VM command: `push local 3`.
The result is:

- SP is incremented by one
- \*(SP-1) is equal to RAM[420+3]

LCL is not an assemble-time constant, so the 420+3 is computed in
runtime like this:

```
@LCL
D=M     // D is 420
@3
A=D+A   // A is 423
D=M     // now D is RAM[423]
```

The other memory segments are dead simple so I will not elaborate.

pop is the exact opposite of push and share a similar syntax. It's worth
mentioning that:

- You decrement SP _before_ copying top of stack to memory segment
- There's no need to erase what you just popped from the stack
- You can't pop to the constant segment

<details markdown="1">
<summary markdown="1">__Watch me struggle to implement `pop local 3`__</summary>

All I have to do is copy \*(--SP) into \*(LCL+3) in assembly. Easy, right?
Here goes:

```
// find out what's on top of stack
@SP
AM=M-1
D=M     // D = RAM[SP]
// go to destination
@LCL
D=M
@3
A=D+A
M=D     // RAM[LCL+3] = D
```

Do you spot the problem? When we did `D=M` the second time, we overwrote
our copy of \*SP with LCL. And that's bad, because we will need \*SP on
the last line.

The solution is a bit convoluted. You need to:

- add 3 to LCL
- decrement SP by one
- copy \*(SP) into D
- go to \*LCL
- write D into \*LCL
- subtract 3 from LCL

The main difference is that we no longer touch D as we relay the data. The
assembly is 14 instructions long, compared to 10 when we do `push local
3`, as a result of this limitation I call "not enough registers".

I would like to thank John Downey for [his
implementation](https://github.com/jtdowney/nand2tetris/blob/master/07/commands/pop_command.rb)
which I referred to when I was stuck.
</details>

OK, so we're done with pushing and popping. It doesn't sound very
exciting, does it? After all it's just copying and pasting stuff into more
stuff. Indeed, we need to discover what we can do with our stack.

## Stack arithmetics

When we program, we will need to do math at one point (gasp, who could've
thought that). Thus, we will be implementing `add`, `sub` and `neg`. These
commands do not take "parameters" like push and pop do because they always
operate on the top of stack. Suppose we have two numbers, a and b, with
b being the topmost word on our stack, i.e. \*(SP-1).

![SP=259, Stack: 420, a, b](img/nand2tetris_2.1/stack_before_add.png)

After we do `add`, the stack becomes this:

![SP=258, Stack: 420, a+b, b](img/nand2tetris_2.1/stack_after_add.png)

Similar to `pop`, stack arithmetic commands do not erase anything at or
after the SP marker because the stack machine only cares about what is
_in_ the stack, i.e. from 256 to SP.

Now it should be clear what `sub` does; instead of a+b, it pushes a-b.
`neg` only takes one value, so the effect is the top of stack gets
negated, i.e. b becomes -b.

## Logical commands

I classify the so-called logical commands into two categories:

- Bitwise operators: `and`, `or`, and `not`
- Comparison: `lt`, `gt`, and `eq`

### Bitwise operators

According to Shocken, logical commands return booleans. There are two
problems with this statement:

1. Booleans are indistinguishable from integers on the hardware layer;
2. _How much is a boolean?_

Without thinking I assumed false is 0 and true is 1, as in `stdbool.h`.
However, upon testing the CPU emulator told me that, in fact, true is -1,
or 65535. Not a single word was said about this specification in the
lectures. I suppose the authors had this written in their book somewhere
but forgot about it in the videos.

<details markdown="1">
<summary markdown="1">__In hindsight, this is obvious__</summary>

Suppose false is 0 and true is 1:

```
false = 0000 0000 0000 0000
true  = 0000 0000 0000 0001
```

Now we do some logical commands:

```
false & true = 0000 0000 0000 0000 = false
false | true = 0000 0000 0000 0001 = true
```

So far so good, but problems arise as soon as we do a bitwise NOT, which
is the only native instruction related to inverting a boolean:

```
~false = 1111 1111 1111 1111 = true, probably?
~true  = 1111 1111 1111 1110 = definitely not false
```

Thus we have proven that, if true is 1, a bitwise NOT does not invert it
to false. For this to happen, true has to be 16 ones, i.e. -1.
</details>

### Comparison

We proceed to implement the comparison commands: `lt`, `gt`, and `eq`.
Apparently they are extremely similar in logic:

- Pop topmost word on stack into D (SP is decremented whenever I say
  "pop")
- Pop and subtract the new topmost word from D
- Compare D to zero
- Push true to stack if [condition], false otherwise

There are two things to figure out at this moment:

- What is [condition] for each of the three commands
- How do we make this "if … otherwise …" work

To be honest, I was reluctant to use branching in assembly because if so
I would have to generate a lot of labels. After quite a while messing with
the D register trying to find an "elegant" solution, I quit. Branches are
not that bad.

<details markdown="1">
<summary markdown="1">__Sample assembly for `lt`__</summary>

Suppose the stack is once again in this state:

![SP=259, Stack: 420, a, b](img/nand2tetris_2.1/stack_before_add.png)

```
@SP
AM=M-1      // SP--
D=M         // D = b
A=A-1
D=M-D       // D = a-b
M=0         // RAM[257] = false
@END_LT_0
D;JGE       // if (D >= 0) goto END_LT_0
@SP
A=M-1
M=-1        // RAM[257] = true
(END_LT_0)
```
</details>

## Function commands

No language is practical without functions. Although I cannot be more
familiar with the notion of functions, the implementation remains
a mystery. The fascinating thing about functions it that it is not
a simple goto; the program has to return where it came from. In other
words, every function call involves a context switch, which will be
reversed once it returns.

Suppose we are in function f(x), and we're calling g(y). We are going to
have two states: one used by f just before g is called, and one used by g.

![two blobs labeled "f's state" and "g's state"](img/nand2tetris_2.1/function_states_blobs.png)

What's in a function's state? As it turns out, it's four pointers and one stack:

![f/g's LCL/ARG/THIS/THAT](img/nand2tetris_2.1/function_states.png)

The neat thing about our VM is that, there's only one global stack, and
the stack of each function just starts somewhere in the middle, instead of
256.

Suppose I, function f, am about to call function g at a point in time when
SP=301. I must first push the arguments it needs, let's say 60 and 9:

![SP=303, LCL=299, ARG=292, THIS=1024, THAT=2048, stack (from 301): 60, 9](img/nand2tetris_2.1/stack_before_call.png)

Now, I execute `call g 2`, where the 2 is the number of arguments that
g takes. But before I do anything risky, I need to back up my state. One
of them is the "return address", which is the address of the instruction
I would be executing next if I didn't call g.

<details markdown="1">
<summary markdown="1">__Wait, you're telling me we're predicting the future?__</summary>

It sounds like wizardry but it's not. This is done with the assembler's
symbol substitution: I insert a label (for example, `(Main$ret.0)`) just
before the first instruction of the next VM command. This is where I'll
end up if `call g 2` wasn't there. Knowing the assembler will take care of
this, I just write:

```
@Main$ret.0
D=A
@SP
A=M
M=D     // *SP is the return address
@SP
M=M+1   // SP++
// more code
@Main.g
0;JMP
(Main$ret.0)
// assembly translated from next VM command
```

If `0;JMP` is instruction 419, then the return address pushed is 420.
</details>

Then we just push LCL, ARG, THIS and THAT:

![SP=308, LCL=299, ARG=292, THIS=1024, THAT=2048, stack (from 301): 60, 9,
420, 299, 292, 1024, 2048](img/nand2tetris_2.1/stack_call_backup.png)

The section of memory from 303 to 307 (inclusive) is called the "frame" of
f, because it contains all the information needed by f to recover its
state.

Now that the clones of LCL et al. are safely contained within the frame,
we can commit all sorts of atrocities to them. ARG will become the address
of the first argument pushed onto f's stack, namely 301; LCL will become
the first address after the frame ends. We will not be touching THIS and
THAT just yet.

![SP=308, LCL=308, ARG=301, THIS=1024, THAT=2048, stack (from 301): 60, 9,
420, 299, 292, 1024, 2048](img/nand2tetris_2.1/stack_after_call.png)

At this point f is ready to jump to g. Suppose function g begins with this
VM command: `function g 3`. The 3 here is not the number of arguments.
Instead, it's the number of local variables it needs. Therefore, this VM
command pushes this number of zeroes onto the stack:

![SP=311, LCL=308, ARG=301, THIS=1024, THAT=2048, stack (from 301): 60, 9,
420, 299, 292, 1024, 2048, 0, 0, 0](img/nand2tetris_2.1/stack_after_function.png)

From now on, RAM[311-2047] is the stack that g has access to. It can also
access `argument 0-1` and `local 0-2`.

A dozen commands later, g decides to return. First it copies its topmost
word on stack — the return value — to somewhere convenient for f to use.
Then it exits and hands the control back to f. But in fact, g has no idea
it came from f; it only knows where to find and recover the pre-saved
context.

Suppose g pushed a few things and touched some local variables, and
decides to return.

![SP=313, LCL=308, ARG=301, THIS=1024, THAT=2048, stack (from 301): 60, 9,
420, 299, 292, 1024, 2048, 69, 51, 0, 2, 3519](img/nand2tetris_2.1/stack_before_return.png)

The return value is 3519, and we know that before g was called SP was 301
— always equal to our current ARG. So we copy \*(SP-1) to \*ARG.

When the number of arguments is zero, \*ARG happens to be where we store
the return address, namely 420. We could overwrite it and lose context!
That's why we need to back up the return address somewhere safe and
recover from there; I selected R13 in my implementation and I've had no
problems yet.

After this is done, everything after LCL is useless. So we move SP to LCL:

![SP=308, LCL=308, ARG=301, THIS=1024, THAT=2048, stack (from 301): 3519, 9,
420, 299, 292, 1024, 2048, 69, 51, 0, 2, 3519](img/nand2tetris_2.1/stack_before_restore.png)

Then, we restore LCL et al. by popping them one by one, and move SP right
after the return value.

![SP=302, LCL=299, ARG=292, THIS=1024, THAT=2048, stack (from 301): 3519, 9,
420, 299, 292, 1024, 2048, 69, 51, 0, 2, 3519](img/nand2tetris_2.1/stack_after_return.png)

Finally, we jump to the address indicated by R13, which is 420. And thus,
this grand waltz of function call and return is complete.

## Writing the VM translator

As a person with limited wits and patience, I decided to write the VM
translator in Python 3.10 to leverage the power of pattern matching, aka
`match case`, as defined in [PEP 636](https://peps.python.org/pep-0636/).
This means I can match VM commands like

```
match cmd.split():
    case []:
        continue
    case [("push" | "pop") as action, segment, index]:
        # handle memory command
    case [("add" | "sub" | "neg" | "and" | "or" | "not") as command]:
        # handle arithmetic / logical command
    # more cases here
    case _:
        # invalid command, raise error
```

This level of syntactic elegance is, in my opinion, unmatched among all
languages (pun intended). One other characteristic of the VM translator is
that it's context-free. In contrast, the assembler I wrote in project 06
needs to read the entire file to decide whether a given symbol is
a reference to a label or a variable. However it does depend on _some_
global state, for example to generate unique labels for each `lt` command.

Over the course of implementation, I realized that the hardest part is
that you are programming in two languages at once. You are using one
language to generate another. Of course, assembly is way harder (I forgot
`A=M` _multiple times_, and instead of debugging for hours in the CPU
emulator you should probably recheck if your assembly matches your
pseudocode in terms of pointer dereferencing). But the caveats in Python
showed up in unexpected places. For example, my optimized implementation
of `function <name> <varc>` consists of three parts: snippet A, snippet
B repeated (varc - 1) times, and snippet C. As you can imagine, things
took a dramatic turn as varc approaches zero. It turns out Python does not
raise an exception when I do `snippet_b * (-1)`; it silently returns an
empty string, which means at least one variable is initialized no matter
what. The bug is fixed, but I learned one more reason why Python is weird.

### Code statistics

- 408 lines of code
- About half of it is multiline strings of assembly

## Is the stack machine any good?

The stack machine, or any kind of virtual machine used by industrial
languages, is an intermediate step from high-level source code to machine
code. Not every languages needs it, though; C is a counterexample. Virtual
machines offer you one more level of abstraction, at the cost of
efficiency and size.

<details markdown="1">
<summary markdown="1">__One example how VM increases executable size__</summary>

Let's say we need to do:

```
push constant 1337
pop local 69
```

Our VM translator would output these lines:

```
// push constant 1337
@x
D=A
@SP
A=M
M=D
@SP
M=M+1
// pop local 69
@69
D=A
@LCL
M=M+D
@SP
AM=M-1
D=M
@LCL
A=M
M=D
@69
D=A
@LCL
M=M-D
```

This is 21 instructions, but in reality, some instructions are totally
unnecessary. Like when we incremented, then decremented SP. If we treated
the two commands as one atomic operation and rewrite the assembly from
scratch:

```
@69
D=A
@LCL
M=M+D
@1337
D=M
@LCL
A=M
M=D
@69
D=A
@LCL
M=M-D
```

Down to 13 instructions now.

</details>

The merit of a half-human, half-machine abstraction is increased control
and security. Suppose a function declares itself as `function f 4`, but
asks us to `pop local 9`. The VM translator could, in theory, refuse to
assemble the code, which could be malicious (if it even runs, that is).
But my VM translator does not, because I want to keep my VM translator as
simple as possible. Furthermore, the VM acts as a better troubleshooting
tool than assembly for your compiler.

## Conclusion

In project 07 and 08 of nand2tetris part 2, we built a VM translator that
translates an imaginary architecture to a concrete one. This added layer
of abstraction bridges the gap between high-level source code and machine
language. In this course, we discussed the benefits and costs in the
Perspectives unit; however, specifics are out of scope.

Philosophically, the lesson I learned is that in modern computer
engineering, universality often beats efficiency. The stack machine is
clever in that it is a complete abstraction of assembly. In other words,
everything you expect a computer to do, if you don't care about speed, can
be done on a stack machine, without the need to write a line of assembly.
Incidentally, this also describes high-level languages. Efficiency is
lost, yes, but there exist two justifications:

1. Most applications are not efficiency-critical, and if yours is
   extremely so, you probably shouldn't be using a VM;
2. If you _really_ need efficiency on VM, you can revise your translator
   to generate more efficient assembly.

I would like to say the following about this course:

- Shocken spoke in _every_ single lecture. Impressive.
- I'm not sure if I'm the only one, but I find the video lectures a lot
  more repetitive than those in part 1.
- At least one technical specification (booleans) was not communicated
  well.
- The project is arguably more challenging (intellectually, not
  syntactically) than writing the assembler.

That's all for project 07 and 08; see you in Part 2.2!