summaryrefslogtreecommitdiff
diff options
context:
space:
mode:
authorFrederick Yin <fkfd@fkfd.me>2022-07-11 15:23:25 +0800
committerFrederick Yin <fkfd@fkfd.me>2022-07-11 15:23:25 +0800
commit3785b2430a675c52bcb5f0cf6aac9d1ef7cca3c6 (patch)
tree5174048b010b5d1db57dc2807b1f0575cbbc325f
parent763b99b4b1386bca0309b1f91f82795c7bb5e916 (diff)
Write instructions for self about patching dataset
-rw-r--r--README.md21
1 files changed, 21 insertions, 0 deletions
diff --git a/README.md b/README.md
index d6ae9e6..84df4dc 100644
--- a/README.md
+++ b/README.md
@@ -109,3 +109,24 @@ I don't have any lawyer friends but what I know is no one can own
non-trademarked words in the English language. On this ground, all words
in the datasets are in the public domain, but the lyrics in the form of
full lines are owned by TØP and/or FBR.
+
+## Patching the dataset
+
+The dataset I'm using right now is 70% machine-generated and 30% manual
+labor. Re-generating it then doing all the work again is way beyond
+practicality. It has happened so many times I had to manually fix the
+dataset because of a mistake I made, but forgot to modify metadata.
+Therefore, I decided to put this checklist here for future me.
+
+### Procedure for deleting a word
+
+- Remove word from `words.js`
+- Remove word from `words.html`
+- Decrement rowspan of `<td>` element for track title
+- Search for word in most frequent list, remove row if present
+- Decrement word count in `<h2>` element(s) of `word.html`
+- Decrement word count in "How many words are there?" section of `index.html`
+- Decrement word count in `README.md`
+- `grep` for word in `data/`, remove all occurrences in `tracks_words`,
+ `words`, `words.json`, and `most_frequent`
+- `scp *.html words.js www@fkfd.me:www/toys/one_top_song/`