Write instructions for self about patching dataset

author: Frederick Yin <fkfd@fkfd.me> 2022-07-11 15:23:25 +0800
committer: Frederick Yin <fkfd@fkfd.me> 2022-07-11 15:23:25 +0800
commit: 3785b2430a675c52bcb5f0cf6aac9d1ef7cca3c6 (patch)
tree: 5174048b010b5d1db57dc2807b1f0575cbbc325f
parent: 763b99b4b1386bca0309b1f91f82795c7bb5e916 (diff)
1 files changed, 21 insertions, 0 deletions
diff --git a/README.md b/README.md
index d6ae9e6..84df4dc 100644
--- a/README.md
+++ b/README.md
@@ -109,3 +109,24 @@ I don't have any lawyer friends but what I know is no one can own
 non-trademarked words in the English language. On this ground, all words
 in the datasets are in the public domain, but the lyrics in the form of
 full lines are owned by TØP and/or FBR.
+
+## Patching the dataset
+
+The dataset I'm using right now is 70% machine-generated and 30% manual
+labor. Re-generating it then doing all the work again is way beyond
+practicality. It has happened so many times I had to manually fix the
+dataset because of a mistake I made, but forgot to modify metadata.
+Therefore, I decided to put this checklist here for future me.
+
+### Procedure for deleting a word
+
+- Remove word from `words.js`
+- Remove word from `words.html`
+- Decrement rowspan of `<td>` element for track title
+- Search for word in most frequent list, remove row if present
+- Decrement word count in `<h2>` element(s) of `word.html`
+- Decrement word count in "How many words are there?" section of `index.html`
+- Decrement word count in `README.md`
+- `grep` for word in `data/`, remove all occurrences in `tracks_words`,
+  `words`, `words.json`, and `most_frequent`
+- `scp *.html words.js www@fkfd.me:www/toys/one_top_song/`
author	Frederick Yin <fkfd@fkfd.me>	2022-07-11 15:23:25 +0800
committer	Frederick Yin <fkfd@fkfd.me>	2022-07-11 15:23:25 +0800
commit	3785b2430a675c52bcb5f0cf6aac9d1ef7cca3c6 (patch)
tree	5174048b010b5d1db57dc2807b1f0575cbbc325f
parent	763b99b4b1386bca0309b1f91f82795c7bb5e916 (diff)