From 58b42b8c50b4ab684f856a73dd81d725bf42df88 Mon Sep 17 00:00:00 2001 From: Frederick Yin Date: Wed, 14 Sep 2022 13:10:55 +0800 Subject: New post: projects/nand2tetris_2.2 --- docs/projects/index.md | 4 + docs/projects/nand2tetris_2.2.md | 178 +++++++++++++++++++++++++++++++++++++++ 2 files changed, 182 insertions(+) create mode 100644 docs/projects/nand2tetris_2.2.md (limited to 'docs') diff --git a/docs/projects/index.md b/docs/projects/index.md index 8ca35b4..e20b422 100644 --- a/docs/projects/index.md +++ b/docs/projects/index.md @@ -26,6 +26,10 @@ A few days after Part 1 was finished, I entered Part 2. There were so many things ahead of me that I decided to split it into multiple blogposts. In Part 2.1 I learned about the stack machine and wrote a VM translator. +### [Part 2.2](nand2tetris_2.2/) + +In nand2tetris 2.2 I built a tokenizer for a simple language, Jack. + ## [SIRTET](sirtet) ![Screenshot of SIRTET mid-game](img/sirtet/sirtet.png) diff --git a/docs/projects/nand2tetris_2.2.md b/docs/projects/nand2tetris_2.2.md new file mode 100644 index 0000000..e118461 --- /dev/null +++ b/docs/projects/nand2tetris_2.2.md @@ -0,0 +1,178 @@ +# nand2tetris, Part 2.2 + +2022-09-14 + +Welcome back! In Part 2.2 we are going to start writing a compiler. It +compiles a high-level language called Jack to VM commands we saw in the +[last chapter](../nand2tetris_2.1/). For ease of organization I decided, +once again, to write it in Python. + +The compiler is a three-step process. In this blogpost we are writing +the first one — a tokenizer. What does a tokenizer do? + +## Tokenizer + +Say we have a file called "Main.jack", containing only this line: + +``` +class Main {} +``` + +The tokenizer is going to read it, then extract meaningful elements or +"tokens" one by one until the file is exhausted. For this file, the tokens +will be `class`, `Main`, `{` and `}`. + +But how? I've never taken a compiler course before, so I was at a complete +(compilete?) loss. + +I was tempted to say, "oh simple. I can just split the file and tokenize +each element from what it looks like." But split by what? Whitespace? No, +because you see no whitespace between the curly braces. And while you +could traverse it char by char, efficiency is up to debate. + +
+__I considered this a while back. Here's how I would have done it__ + +Suppose I'm inside the function that reads the file char by char. I'm +going to define a list of tokens, let's say instances of the class +`Token`, like this: + +``` +tokens = [] +``` + +Then I will implement method `Token.append_char(char: str)` that takes one +character, and: + +- append it to itself if it is allowed (e.g. `e` after `someVariabl`) +- do nothing if the character doesn't matter (e.g. a space after a paren) +- signal its termination if the char ends the token and starts another + (e.g. `+` after a variable) +- raise an error if the token is not allowed at all (e.g. `a` directly + after `123`) + +This way when we read a `char`, we try `tokens[-1].append_char(char)` to +see if the last token we saw accepts it. If not we just end the previous +token, and push the new one to the list of tokens. + +You can see how this is an efficiency disaster. So many function calls are +unnecessary if we could just look ahead more characters at a time. + +
+ +For this very concern, I rejected this char-by-char idea in favor of +line-by-line. The solution is, I feed the line into a function which +extracts and returns the first complete token. Then I ask the returned +token how many characters long it is. If it answers `n`, I remove the +first `n` characters (and leading whitespace) from the line, then feed it +back into the same function again. Repeat this until the line has nothing +left, and I will get a list of tokens. + +I took advantage of certain rules in the Jack language to simplify things. +For instance all symbols are 1 character long. No `==`, not even `<=`, +only these: + +``` +{}()[].,;+-*/&|<>=~ +``` + +This means if the first character is in the list of symbols, that's gotta +be it; we can stop looking. + +If the first character is a digit, then the token is no doubt an integer +literal. If it is a letter or underscore, we first check if it's +a keyword; if not it must be an identifier, e.g. class name, variable, +etc. + +And there are strings. The Jack standard specifies a string literal to be +"zero or more non-double-quote characters surrounded by double quotes" +(paraphrased). Easy, regex go brrr: `(".*?")` + +The `*?` signals a non-greedy search. For example, suppose we have this +string: + +``` +"string 1".append("string 2") +``` + +and match it against these two regexes: + +``` +(".*") -> "string 1.append("string 2" +(".*?") -> "string 1" +``` + +The second one is correct. + +
+__What about escape sequences?__ + +Canonically, Jack does not support them. A string ends whenever a double +quote is met. I don't like it. I want escape sequences. I _need_ escape +sequences. + +And thus I defined a "language extension" which is opt-in through argument +`-E escape`. The implementation is as simple as changing the regex, but it +took me a while to figure out. + +The regex is `("(.|\\")+?[^\\]")`. Let's take it apart: + +``` +( ) group 1 + " literal " + ( )+? group 2, one or more, non-greedy + . any character + | or + \\" \" + [^\\] any character except \ + " literal " +``` + +It will keep reading the line until it meets an unescaped double quote. +Problem solved. +
+ +When a string is done, we come back to the main tokenizer function and +look for the next token. And this is how I tokenize the source code line +by line. + +A small caveat. There are multiline comments in Jack that could mess +things up if I read line by line naïvely. I have to keep a flag in the +scope of this file so I know if I'm inside one or not. + +## Error handling + +A challenge I set for myself is to write sensible error messages. +Something to alert the Jack programmer (who, too, is me). Therefore +whenever I instantiate a token, I pass the current line number and the +position of the token on that line. Whenever the tokenizer fails to +recognize something, it will put a caret under it: + +``` +$ cat bad.jack +class Main { + hehe I don't give a shit +} + +$ python -m hackc bad.jack +bad.jack:2 + hehe I don't give a shit + ^ Invalid token +``` + +Caveat: it only works for indentation with spaces for now because the +position in line is counted with respect to bytes, not columns. It _is_ +possible to change this behavior. + +## Conclusion + +The tokenizer is the first step from source code to executable programs. +It bridges the gap between human and machine, and thus involves the most +"stupid" intricacies. However, as soon as we get this over with, the rest +is constructing a syntax tree from the list of tokens with regard to a few +rules. + +You can check out my code here: +[hackc](https://git.fkfd.me/nand2tetris.git/tree/projects/hackc) + +See you in part 2.3! -- cgit v1.2.3