From 58b42b8c50b4ab684f856a73dd81d725bf42df88 Mon Sep 17 00:00:00 2001
From: Frederick Yin <fkfd@fkfd.me>
Date: Wed, 14 Sep 2022 13:10:55 +0800
Subject: New post: projects/nand2tetris_2.2

---
 docs/projects/index.md           |   4 +
 docs/projects/nand2tetris_2.2.md | 178 +++++++++++++++++++++++++++++++++++++++
 2 files changed, 182 insertions(+)
 create mode 100644 docs/projects/nand2tetris_2.2.md

(limited to 'docs')
diff --git a/docs/projects/index.md b/docs/projects/index.md
index 8ca35b4..e20b422 100644
--- a/docs/projects/index.md
+++ b/docs/projects/index.md
@@ -26,6 +26,10 @@ A few days after Part 1 was finished, I entered Part 2. There were so many
 things ahead of me that I decided to split it into multiple blogposts. In
 Part 2.1 I learned about the stack machine and wrote a VM translator.
 
+### [Part 2.2](nand2tetris_2.2/)
+
+In nand2tetris 2.2 I built a tokenizer for a simple language, Jack.
+
 ## [SIRTET](sirtet)
 
 ![Screenshot of SIRTET mid-game](img/sirtet/sirtet.png)
diff --git a/docs/projects/nand2tetris_2.2.md b/docs/projects/nand2tetris_2.2.md
new file mode 100644
index 0000000..e118461
--- /dev/null
+++ b/docs/projects/nand2tetris_2.2.md
@@ -0,0 +1,178 @@
+# nand2tetris, Part 2.2
+
+2022-09-14
+
+Welcome back! In Part 2.2 we are going to start writing a compiler. It
+compiles a high-level language called Jack to VM commands we saw in the
+[last chapter](../nand2tetris_2.1/). For ease of organization I decided,
+once again, to write it in Python.
+
+The compiler is a three-step process. In this blogpost we are writing
+the first one — a tokenizer. What does a tokenizer do?
+
+## Tokenizer
+
+Say we have a file called "Main.jack", containing only this line:
+
+```
+class Main {}
+```
+
+The tokenizer is going to read it, then extract meaningful elements or
+"tokens" one by one until the file is exhausted. For this file, the tokens
+will be `class`, `Main`, `{` and `}`.
+
+But how? I've never taken a compiler course before, so I was at a complete
+(compilete?) loss.
+
+I was tempted to say, "oh simple. I can just split the file and tokenize
+each element from what it looks like." But split by what? Whitespace? No,
+because you see no whitespace between the curly braces. And while you
+could traverse it char by char, efficiency is up to debate.
+
+<details markdown="1">
+<summary markdown="1">__I considered this a while back. Here's how I would have done it__</summary>
+
+Suppose I'm inside the function that reads the file char by char. I'm
+going to define a list of tokens, let's say instances of the class
+`Token`, like this:
+
+```
+tokens = []
+```
+
+Then I will implement method `Token.append_char(char: str)` that takes one
+character, and:
+
+- append it to itself if it is allowed (e.g. `e` after `someVariabl`)
+- do nothing if the character doesn't matter (e.g. a space after a paren)
+- signal its termination if the char ends the token and starts another
+  (e.g. `+` after a variable)
+- raise an error if the token is not allowed at all (e.g. `a` directly
+  after `123`)
+
+This way when we read a `char`, we try `tokens[-1].append_char(char)` to
+see if the last token we saw accepts it. If not we just end the previous
+token, and push the new one to the list of tokens.
+
+You can see how this is an efficiency disaster. So many function calls are
+unnecessary if we could just look ahead more characters at a time.
+
+</details>
+
+For this very concern, I rejected this char-by-char idea in favor of
+line-by-line. The solution is, I feed the line into a function which
+extracts and returns the first complete token. Then I ask the returned
+token how many characters long it is. If it answers `n`, I remove the
+first `n` characters (and leading whitespace) from the line, then feed it
+back into the same function again. Repeat this until the line has nothing
+left, and I will get a list of tokens.
+
+I took advantage of certain rules in the Jack language to simplify things.
+For instance all symbols are 1 character long. No `==`, not even `<=`,
+only these:
+
+```
+{}()[].,;+-*/&|<>=~
+```
+
+This means if the first character is in the list of symbols, that's gotta
+be it; we can stop looking.
+
+If the first character is a digit, then the token is no doubt an integer
+literal. If it is a letter or underscore, we first check if it's
+a keyword; if not it must be an identifier, e.g. class name, variable,
+etc.
+
+And there are strings. The Jack standard specifies a string literal to be
+"zero or more non-double-quote characters surrounded by double quotes"
+(paraphrased). Easy, regex go brrr: `(".*?")`
+
+The `*?` signals a non-greedy search. For example, suppose we have this
+string:
+
+```
+"string 1".append("string 2")
+```
+
+and match it against these two regexes:
+
+```
+(".*")  -> "string 1.append("string 2"
+(".*?") -> "string 1"
+```
+
+The second one is correct.
+
+<details markdown="1">
+<summary markdown="1">__What about escape sequences?__</summary>
+
+Canonically, Jack does not support them. A string ends whenever a double
+quote is met. I don't like it. I want escape sequences. I _need_ escape
+sequences.
+
+And thus I defined a "language extension" which is opt-in through argument
+`-E escape`. The implementation is as simple as changing the regex, but it
+took me a while to figure out.
+
+The regex is `("(.|\\")+?[^\\]")`. Let's take it apart:
+
+```
+(                )  group 1
+ "                      literal "
+  (     )+?             group 2, one or more, non-greedy
+   .                        any character
+    |                       or
+     \\"                    \"
+           [^\\]        any character except \
+                "       literal "
+```
+
+It will keep reading the line until it meets an unescaped double quote.
+Problem solved.
+</details>
+
+When a string is done, we come back to the main tokenizer function and
+look for the next token. And this is how I tokenize the source code line
+by line.
+
+A small caveat. There are multiline comments in Jack that could mess
+things up if I read line by line naïvely. I have to keep a flag in the
+scope of this file so I know if I'm inside one or not.
+
+## Error handling
+
+A challenge I set for myself is to write sensible error messages.
+Something to alert the Jack programmer (who, too, is me). Therefore
+whenever I instantiate a token, I pass the current line number and the
+position of the token on that line. Whenever the tokenizer fails to
+recognize something, it will put a caret under it:
+
+```
+$ cat bad.jack
+class Main {
+    hehe I don't give a shit
+}
+
+$ python -m hackc bad.jack
+bad.jack:2
+    hehe I don't give a shit
+              ^ Invalid token
+```
+
+Caveat: it only works for indentation with spaces for now because the
+position in line is counted with respect to bytes, not columns. It _is_
+possible to change this behavior.
+
+## Conclusion
+
+The tokenizer is the first step from source code to executable programs.
+It bridges the gap between human and machine, and thus involves the most
+"stupid" intricacies. However, as soon as we get this over with, the rest
+is constructing a syntax tree from the list of tokens with regard to a few
+rules.
+
+You can check out my code here:
+[hackc](https://git.fkfd.me/nand2tetris.git/tree/projects/hackc)
+
+See you in part 2.3!
-- 
cgit v1.2.3