summaryrefslogtreecommitdiff
path: root/docs/projects/nand2tetris_2.2.md
blob: 198c0484e076b72b68d4b35c9ce4148ce22082e7 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
# nand2tetris, Part 2.2

2022-09-14

Welcome back! In Part 2.2 we are going to start writing a compiler. It
compiles a high-level language called Jack to VM commands we saw in the
[last chapter](nand2tetris_2.1.md). For ease of organization I decided,
once again, to write it in Python.

The compiler is a three-step process. In this blogpost we are writing
the first one — a tokenizer. What does a tokenizer do?

## Tokenizer

Say we have a file called "Main.jack", containing only this line:

```
class Main {}
```

The tokenizer is going to read it, then extract meaningful elements or
"tokens" one by one until the file is exhausted. For this file, the tokens
will be `class`, `Main`, `{` and `}`.

But how? I've never taken a compiler course before, so I was at a complete
(compilete?) loss.

I was tempted to say, "oh simple. I can just split the file and tokenize
each element from what it looks like." But split by what? Whitespace? No,
because you see no whitespace between the curly braces. And while you
could traverse it char by char, efficiency is up to debate.

<details markdown="1">
<summary markdown="1">__I considered this a while back. Here's how I would have done it__</summary>

Suppose I'm inside the function that reads the file char by char. I'm
going to define a list of tokens, let's say instances of the class
`Token`, like this:

```
tokens = []
```

Then I will implement method `Token.append_char(char: str)` that takes one
character, and:

- append it to itself if it is allowed (e.g. `e` after `someVariabl`)
- do nothing if the character doesn't matter (e.g. a space after a paren)
- signal its termination if the char ends the token and starts another
  (e.g. `+` after a variable)
- raise an error if the token is not allowed at all (e.g. `a` directly
  after `123`)

This way when we read a `char`, we try `tokens[-1].append_char(char)` to
see if the last token we saw accepts it. If not we just end the previous
token, and push the new one to the list of tokens.

You can see how this is an efficiency disaster. So many function calls are
unnecessary if we could just look ahead more characters at a time.

</details>

For this very concern, I rejected this char-by-char idea in favor of
line-by-line. The solution is, I feed the line into a function which
extracts and returns the first complete token. Then I ask the returned
token how many characters long it is. If it answers `n`, I remove the
first `n` characters (and leading whitespace) from the line, then feed it
back into the same function again. Repeat this until the line has nothing
left, and I will get a list of tokens.

I took advantage of certain rules in the Jack language to simplify things.
For instance all symbols are 1 character long. No `==`, not even `<=`,
only these:

```
{}()[].,;+-*/&|<>=~
```

This means if the first character is in the list of symbols, that's gotta
be it; we can stop looking.

If the first character is a digit, then the token is no doubt an integer
literal. If it is a letter or underscore, we first check if it's
a keyword; if not it must be an identifier, e.g. class name, variable,
etc.

And there are strings. The Jack standard specifies a string literal to be
"zero or more non-double-quote characters surrounded by double quotes"
(paraphrased). Easy, regex go brrr: `(".*?")`

The `*?` signals a non-greedy search. For example, suppose we have this
string:

```
"string 1".append("string 2")
```

and match it against these two regexes:

```
(".*")  -> "string 1.append("string 2"
(".*?") -> "string 1"
```

The second one is correct.

<details markdown="1">
<summary markdown="1">__What about escape sequences?__</summary>

Canonically, Jack does not support them. A string ends whenever a double
quote is met. I don't like it. I want escape sequences. I _need_ escape
sequences.

And thus I defined a "language extension" which is opt-in through argument
`-E escape`. The implementation is as simple as changing the regex, but it
took me a while to figure out.

The regex is `("(.|\\")+?[^\\]")`. Let's take it apart:

```
(                )  group 1
 "                      literal "
  (     )+?             group 2, one or more, non-greedy
   .                        any character
    |                       or
     \\"                    \"
           [^\\]        any character except \
                "       literal "
```

It will keep reading the line until it meets an unescaped double quote.
Problem solved.
</details>

When a string is done, we come back to the main tokenizer function and
look for the next token. And this is how I tokenize the source code line
by line.

A small caveat. There are multiline comments in Jack that could mess
things up if I read line by line naïvely. I have to keep a flag in the
scope of this file so I know if I'm inside one or not.

## Error handling

A challenge I set for myself is to write sensible error messages.
Something to alert the Jack programmer (who, too, is me). Therefore
whenever I instantiate a token, I pass the current line number and the
position of the token on that line. Whenever the tokenizer fails to
recognize something, it will put a caret under it:

```
$ cat bad.jack
class Main {
    hehe I don't give a shit
}

$ python -m hackc bad.jack
bad.jack:2
    hehe I don't give a shit
              ^ Invalid token
```

Caveat: it only works for indentation with spaces for now because the
position in line is counted with respect to bytes, not columns. It _is_
possible to change this behavior.

## Conclusion

The tokenizer is the first step from source code to executable programs.
It bridges the gap between human and machine, and thus involves the most
"stupid" intricacies. However, as soon as we get this over with, the rest
is constructing a syntax tree from the list of tokens with regard to a few
rules.

You can check out my code here:
[hackc](https://git.fkfd.me/nand2tetris.git/tree/projects/hackc)

See you in part 2.3!