From 3785b2430a675c52bcb5f0cf6aac9d1ef7cca3c6 Mon Sep 17 00:00:00 2001 From: Frederick Yin Date: Mon, 11 Jul 2022 15:23:25 +0800 Subject: Write instructions for self about patching dataset --- README.md | 21 +++++++++++++++++++++ 1 file changed, 21 insertions(+) diff --git a/README.md b/README.md index d6ae9e6..84df4dc 100644 --- a/README.md +++ b/README.md @@ -109,3 +109,24 @@ I don't have any lawyer friends but what I know is no one can own non-trademarked words in the English language. On this ground, all words in the datasets are in the public domain, but the lyrics in the form of full lines are owned by TØP and/or FBR. + +## Patching the dataset + +The dataset I'm using right now is 70% machine-generated and 30% manual +labor. Re-generating it then doing all the work again is way beyond +practicality. It has happened so many times I had to manually fix the +dataset because of a mistake I made, but forgot to modify metadata. +Therefore, I decided to put this checklist here for future me. + +### Procedure for deleting a word + +- Remove word from `words.js` +- Remove word from `words.html` +- Decrement rowspan of `` element for track title +- Search for word in most frequent list, remove row if present +- Decrement word count in `

` element(s) of `word.html` +- Decrement word count in "How many words are there?" section of `index.html` +- Decrement word count in `README.md` +- `grep` for word in `data/`, remove all occurrences in `tracks_words`, + `words`, `words.json`, and `most_frequent` +- `scp *.html words.js www@fkfd.me:www/toys/one_top_song/` -- cgit v1.2.3