diff options
author | Frederick Yin <fkfd@fkfd.me> | 2022-05-08 12:58:06 +0800 |
---|---|---|
committer | Frederick Yin <fkfd@fkfd.me> | 2022-05-08 12:58:06 +0800 |
commit | d60e18cf51fb42fdb1176f5e5e7813821beb2add (patch) | |
tree | d9041d10776c47ea013dddd5a6c10ff1aaab887a /docs | |
parent | 102843977f3d58c4b8fb9c15996da5cfbbbbee22 (diff) |
New post: projects/one_top_song
Diffstat (limited to 'docs')
-rw-r--r-- | docs/projects/img/one_top_song/demo.mp4 | bin | 0 -> 2059865 bytes | |||
-rw-r--r-- | docs/projects/img/one_top_song/difficult_beast.mp3 | bin | 0 -> 102967 bytes | |||
-rw-r--r-- | docs/projects/img/one_top_song/top_logo.png | bin | 0 -> 25935 bytes | |||
-rw-r--r-- | docs/projects/img/one_top_song/ui_desktop.png | bin | 0 -> 70797 bytes | |||
-rw-r--r-- | docs/projects/img/one_top_song/ui_mobile.png | bin | 0 -> 61731 bytes | |||
-rw-r--r-- | docs/projects/img/one_top_song/words_all.png | bin | 0 -> 84651 bytes | |||
-rw-r--r-- | docs/projects/img/one_top_song/words_frequent.png | bin | 0 -> 51920 bytes | |||
-rw-r--r-- | docs/projects/index.md | 11 | ||||
-rw-r--r-- | docs/projects/one_top_song.md | 527 |
9 files changed, 538 insertions, 0 deletions
diff --git a/docs/projects/img/one_top_song/demo.mp4 b/docs/projects/img/one_top_song/demo.mp4 Binary files differnew file mode 100644 index 0000000..367daee --- /dev/null +++ b/docs/projects/img/one_top_song/demo.mp4 diff --git a/docs/projects/img/one_top_song/difficult_beast.mp3 b/docs/projects/img/one_top_song/difficult_beast.mp3 Binary files differnew file mode 100644 index 0000000..963e093 --- /dev/null +++ b/docs/projects/img/one_top_song/difficult_beast.mp3 diff --git a/docs/projects/img/one_top_song/top_logo.png b/docs/projects/img/one_top_song/top_logo.png Binary files differnew file mode 100644 index 0000000..1cdd290 --- /dev/null +++ b/docs/projects/img/one_top_song/top_logo.png diff --git a/docs/projects/img/one_top_song/ui_desktop.png b/docs/projects/img/one_top_song/ui_desktop.png Binary files differnew file mode 100644 index 0000000..5f44ca2 --- /dev/null +++ b/docs/projects/img/one_top_song/ui_desktop.png diff --git a/docs/projects/img/one_top_song/ui_mobile.png b/docs/projects/img/one_top_song/ui_mobile.png Binary files differnew file mode 100644 index 0000000..0aadb87 --- /dev/null +++ b/docs/projects/img/one_top_song/ui_mobile.png diff --git a/docs/projects/img/one_top_song/words_all.png b/docs/projects/img/one_top_song/words_all.png Binary files differnew file mode 100644 index 0000000..bf377b9 --- /dev/null +++ b/docs/projects/img/one_top_song/words_all.png diff --git a/docs/projects/img/one_top_song/words_frequent.png b/docs/projects/img/one_top_song/words_frequent.png Binary files differnew file mode 100644 index 0000000..a56ee5b --- /dev/null +++ b/docs/projects/img/one_top_song/words_frequent.png diff --git a/docs/projects/index.md b/docs/projects/index.md index 529f11d..12b21ed 100644 --- a/docs/projects/index.md +++ b/docs/projects/index.md @@ -8,6 +8,17 @@ MkDocs). But the few that do, are here. Projects below are sorted reverse chronologically (most recent first). +## [One tøp song](one_top_song) + +![Screenshot of desktop UI](img/one_top_song/ui_desktop.png) + +On April 19, 2022, I released a web game made out of words that only +appear in one twenty øne piløts song. It involves automation using curl, +Python, and Unix utilities, but on top of it there's a lot of manual work. +Here are the steps I took over the course of this project, from +downloading the lyrics, to generating a dataset, and finally making +a game. + ## [Kanvas](kanvas) ![Screenshot of Kanvas 0.1.0](img/kanvas/screenshot_0.1.0.png) diff --git a/docs/projects/one_top_song.md b/docs/projects/one_top_song.md new file mode 100644 index 0000000..761b769 --- /dev/null +++ b/docs/projects/one_top_song.md @@ -0,0 +1,527 @@ +# One tøp song + +2022-05-08 + +I'm a die-hard fan of twenty øne piløts (did you know they're a two piece +band?) You can see this from the fact that I take the trouble to stylize +the band name with ø's, even in its acronym, tøp. Therefore, you wouldn't +expect neutrality from this blogpost. + +The band and its members, Tyler Joseph and Josh Dun, are known for +a Grammy and two all-gold records on RIAA, but to me they're irrelevant +(the awards, not the members). I like the vibe of their songs and +especially the lyrics. For example, take a look at the insightful final +lines from _Pet Cheetah_ (Trench) that build up to a pumping crescendo: + +> Pet cheetah, cheetah +> Pet cheetah, cheetah +> Pet cheetah, cheetah +> Pet cheetah, cheetah +> Pet cheetah, cheetah +> Pet cheetah, cheetah +> Pet cheetah, cheetah +> Pet cheetah, cheetah +> Pet cheetah, cheetah +> Pet cheetah, cheetah +> Pet cheetah, cheetah +> Pet cheetah, cheetah +> Pet cheetah, cheetah +> Pet cheetah, cheetah +> Pet cheetah, cheetah +> Pet cheetah, cheetah + +Whatever you say, I think it's a one-of-a-kind song that discusses making +music for a fanbase (yes they make a lot of meta songs like this). The +lines above are simple, but unique as well. Among all tøp songs, this is +the only one that features the word "pet", and also "cheetah", just like +how _Nico And The Niners_ is the only one with "Nico" and "Niners". Wait, +that's not right, because "Nico" appears earlier in the album, in the +second verse of _Morph_. + +This brought me into thinking: How many words are there that appear in +only one twenty øne piløts song? And to pay off my efforts, can I turn +this into a fun game for other tøp fans to play? + +For the impatient, you may skip all the procedures and technicality. Go +ahead and check out the [results](#results). Everyone else, please take +your time on your ride. + +## Step 1: Download the lyrics + +This isn't as easy as it seemed, nor is it too hard. The lyric provider is +azlyrics.com, because it works without JavaScript and serves +machine-readable HTML. So I went ahead and curl'd a random page. + +``` +$ curl https://www.azlyrics.com/lyrics/twentyonepilots/truce.html +<html> +<head><title>302 Found</title></head> +<body> +<center><h1>302 Found</h1></center> +<hr><center>nginx</center> +</body> +</html> +``` + +OK, time to `man curl` for the option to follow redirections. It's `-L`, +btw. (HTML prettified) + +``` +$ curl -L https://www.azlyrics.com/lyrics/twentyonepilots/truce.html +<!DOCTYPE html> +<html lang="en"> + <head> + <!-- some meta tags --> + <title>AZLyrics - request for access</title> + <!-- some stylesheets --> + <!-- some <IE9 compat scripts --> + <!-- jquery and the like --> + <!-- recaptcha script --> + </head> + + <body> + <nav>...</nav> + <!-- a commented out banner --> + + <!-- a few nested divs --> + Access denied. + <!-- end nested divs --> + + <!-- a commented out block with the note "bot ban" --> + + <!-- footer --> + </body> +</html> +``` + +Damn, that's pretty… nasty, but it's exactly how I expected it to go. Now, +I've done a lot of web scraping, so I know it's possible to fake a few +HTTP headers to give curl some human skin. The most common headers are: + +- Referer +- Cookie +- User-Agent + +So I tried them one by one. User-Agent worked. + +``` +$ curl -L -H 'User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:100.0) Gecko/20100101 Firefox/100.0' \ + https://www.azlyrics.com/lyrics/twentyonepilots/truce.html +``` + +What is extra funny though, is that the server accepts even an empty UA +string. + +``` +$ curl -L -H 'User-Agent: ' https://www.azlyrics.com/lyrics/twentyonepilots/truce.html +``` + +I couldn't resist: + +``` +$ curl -L -H 'User-Agent: definitely not curl' https://www.azlyrics.com/lyrics/twentyonepilots/truce.html +``` + +Above the lyrics an HTML comment reads: + +> Usage of azlyrics.com content by any third-party lyrics provider is +> prohibited by our licensing agreement. Sorry about that. + +This won't stop me because I can't read. + +So long story short, I curl'd the twenty øne pilots index page and +BeautifulSoup'd all the title-URL pairs, which are once again curl'd and +BeautifulSoup'd. + +Soon I have a directory full of [song title].txt, but not all of them are +useful. A few songs are not technically part of tøp's canon discography +(some fans are gonna disagree on this one but I don't care), like the +Elvis cover _Can't Help Falling In Love_, which is just a [YouTube +video](https://www.youtube.com/watch?v=6ThQkrXHdh4) of Tyler singing in +the street; another one, [_Coconut Sharks In The +Water_](https://youtu.be/jFwsnrkK9sU), although well-known among fans, was +only performed once for comical effect in 2011. In the end, I included +their six studio albums and five singles, totaling 79 songs. + +On to step 2! + +## Step 2: Look for every word + +This is the core part of the project. I knew it's impossible by hand, so +I sat down to write an algorithm in Python. It goes like this in +pseudocode: + +``` +lyrics = dict() +for song in all_songs: + lyrics[song] = read(song + ".txt").split_words() + +for song in all_songs: + other_songs = list(s in all_songs such that s != song) + for word in lyrics[song]: + for other_song in other_songs: + if lyrics[other_song].includes(word): + found = True + + if not found: + append("results.txt", song + "\t" + word) +``` + +The latter block had three nested for loops. To optimize it a bit, I read +all files before hand, split each one up into individual words, then threw +them into a set to remove the duplicates. As for the third for loop, +I _could_ call `break` right after `found = True`, but instead resorted to +the magic of list comprehension (variable names and structure taken from +pseudocode above): + +``` + if any([(word in lyrics[o]) for o in other_songs]): + append("results.txt", song + "\t" + word) +``` + +I like to imagine Python optimized this one for me, but I'm not sure. +Anyway, even if it doesn't this shouldn't be too bad. Plus, I like +one-liners. + +When splitting words, they are converted to lowercase. Punctuation marks +and suffixes like 's and 'd are removed, but I forgot to remove 've. +Fortunately there weren't many of them, so I removed them by hand. + +You can read the real source code here: +[`data/one_song_words.py`](https://git.sr.ht/~fkfd/one_top_song/tree/main/item/data/one_song_words.py) + +## Step 3: Dedupe + +The previous step brought about a problem. The script I wrote treated +inflections as separate words, e.g. "vibe" (_Chlorine_), "vibes" and +"vibing" (_The Outside_). So I wrote a script to find most of them. + +The script reports occurrences of the following inflections of `word`: + +``` +word + "s", +word + "es", +word + "d", +word + "ed", +word + "ing", +``` + +and also in reverse, if `word` is already inflected: + +``` +re.sub("s$", "", word), +re.sub("es$", "", word), +re.sub("d$", "", word), +re.sub("ed$", "", word), +re.sub("ing$", "", word), +``` + +And when I ran it, what happened is it caught most of the offenders — like +"vibe" vs. "vibes" — but not more subtle ones like "vibing". I ended up +removing them again by hand, but it's possible I missed some. + +Why didn't I just tell the script to remove the inflections automatically? +Because there were false positives. For example, "sing" (_Bandito_ and +many others) and "singed" (_Leave The City_) are not the same thing. Other +examples include "to" and "toes", "she" and "shed", "not" and "notes", +"even" and "evening", etc. Also, although some pairs are of the same +origin, they're pretty different semantically, like "weathered" +(_Chlorine_) and "weather" (_Good Day_ and _Migraine_). Leaving these +alone, I axed everything else from my list. + +Source code: [`data/dedupe.py`](https://git.sr.ht/~fkfd/one_top_song/tree/main/item/data/dedupe.py) + +## Step 4: Manual inspection + +It was at this moment that I realized that I had forgot about stuff like +"[x10]" (_Holding On To You_) that marks a repeated line. There were some +onomatopoeic words like "mm-mm" (_Choker_), too, and don't get me started +on hyphenated words: there were "treehouse" (_Forest_) and "tree-house" +(_Stressed Out_). Words like "migraine", which comes from a song titled +_Migraine_, are too easy for a game, so they are not included either. +I also capitalized proper nouns like "Monday", and removed trailing +periods and commas from every line I could find. In retrospect it could +have been easier if I sanitized the lyric files from the beginning. At +this moment there are 1,002 words left, but I don't know if there's more +to knock out. I doubt anyone will notice. + +Here's a fun story: after I deployed the app (yes there'll be a web app at +the end) on r/twentyonepilots, one player reported an incorrect lyric from +_Migraine_: + +> A difficult to be, stop feasting lumber-down trees + +At first glance this lyric seemed unfamiliar to me, and it definitely +isn't grammatically correct. I checked multiple sources: on azlyrics of +course it's this one, but on +[Genius](https://genius.com/Twenty-one-pilots-migraine-lyrics) it says +otherwise: + +> A difficult beast feasting on burnt down trees + +Oops, better go check out the description from the [official +audio](https://www.youtube.com/watch?v=Bs92ejAGLdw) on Fueled By Ramen's +(tøp's label, FBR for short) YouTube channel: + +> a difficult to be, stop feasting lumber down trees + +And [this video at 14:40](https://youtu.be/HutQvZWJ_60?t=880) on Warner +Music Japan's channel with Japanese and English subtitles: + +> 燒け落ちた木々貪り食う、気難しい野獸 +> A difficult beast feasting on burnt down trees + +Well, I tried. + +So to settle this the only thing I could do was find out by myself. +I grabbed [WrightP's Official Acapella +version](https://www.youtube.com/watch?v=qGLEH_VeCpE) and extracted that +bit with Audacity. I slowed it down 50%, and it sounds like this: + +<audio controls src="../img/one_top_song/difficult_beast.mp3"></audio> + +Let me explain what I heard: + +> A difficult-a beast-a feasting-on bur- down trees + +The "ng" sound between "feasting" and "on" is audible. There is no "l" +sound as in "lumber-down", and there is no /ɒ/ or /ɑ/ sound following +"st", which rules out "stop". + +That settles it: Genius and WMG Japan are right, azlyrics and FBR are +wrong. I suspect that azlyrics got its lyrics from FBR in the first place. + +Track-word pairs: +[`data/track_words`](https://git.sr.ht/~fkfd/one_top_song/tree/main/item/data/tracks_words) + +## Step 5: Generating a dataset + +Now that I have a 1000-something-line-long file of tab-separated track titles +and unique words, it's time to generate a dataset for the game. Since I'll be +producing a web game, the language is gonna be JavaScript, so the dataset +will be in JSON. The first challenge is we need to know the line from +which each word came from. This way if the player fails to recall it, +we'll show them the line and they will go "hmm, yeah, Tyler really *did* +sing this". But you see, my step 2 script completely scrambled the lyrics. +So I wrote another Python script to "grep" them from the giant heap of txt +files. It was pretty easy, and moments later I have this JSON file +structured like this: + +``` +[ + { + "track": "Redecorate", + "word": "blankets", + "lines": [ + "Then one night she got cold with no blankets on her bed", + "Blankets over mirrors, she tends to like it" + ] + }, + {...},{...},... +] +``` + +I should try to shrink the 135kB (kilo, not kibi) dataset. First, the +prettyprint was unnecessary, so let's do away with it. It instantly went down +to 99kB. However, having everything on one line makes batch editing in vim +a huge pain, and every launch took seconds. So as a compromise I inserted +a linebreak after every word object, so for x words there would be (x+2) +lines including the brackets. 1kB well spent. The JSON file is now a neat +100kB, which is a 26% optimization compared to the initial 135kB. + +However, as I was coding JavaScript I realized that, since we're using the +dataset as a JavaScript object, we don't have to play by JSON's rules. +This means no more double quotes around keys! Each word object has +6 double quotes, 6 times 1000 is… 6kB! That's right, we just shrank the +dataset to 94kB. Now that's a 30% optimization. All by frugal management +of whitespace. + +Later I found it would be better if I tagged the _album_ to each word, but +it would be super redundant. So instead, I placed lists of tracks in each +album inside another JS file that is load alongside the words. + +JSON generator script: [`data/mkjson.py`](https://git.sr.ht/~fkfd/one_top_song/tree/main/item/data/mkjson.py) + +Datasets: [`data/words.json`](https://git.sr.ht/~fkfd/one_top_song/tree/main/item/data/words.json), +[`words.js`](https://git.sr.ht/~fkfd/one_top_song/tree/main/item/words.js), +and [`albums.js`](https://git.sr.ht/~fkfd/one_top_song/tree/main/item/albums.js) + +## Step 6: Design the game UI + +I thought I despised the "mobile first" approach, but it turns out what I hated +was the "mobile only" garbage. [Mobile Wikipedia] actually works remarkably +well on desktop. What I'm doing is so much simpler than Wikipedia. The page +contains the following fundamental elements: + +- the word +- textbox for user input +- candidate list +- controls + +I swear, the desktop version works just as smoothly as on mobile (although +I failed to center a few elements). + +![Desktop UI](img/one_top_song/ui_desktop.png) + +![Mobile UI](img/one_top_song/ui_mobile.png) + +▲ Notice that "twenty øne piløts" are joined with non-breaking spaces + +And my absolute favorite thing here is the candidate list. I wouldn't expect +anyone to type "House Of Gold" in its entirety, would I? Of course there should +be some sort of search suggestion. The candidate list I implemented tries to +match user input against the beginning of each song title, as well as acronyms. +For example "hot" gives you _Holding On To You_. A hack was written for +_Heavydirtysoul_ so that "hds" would match it. + +Oh, I almost forgot: the three buttons are twenty øne piløts-themed. + +![The classic |-/ logo: blue vertical bar, black dash, and red slash](img/one_top_song/top_logo.png) + +▲ Former tøp logo from the Regional at Best era + +## Step 7: Game logic + +From this point there's no repetitive chores, and I can finally focus on +making a game. The concept is simple: the player tries to guess the song +that a word came from. + +Let me enumerate the steps in which the player would interact with my game: + +1. Game shows random word taken from dataset +2. Player types track title into textbar, confirms +3. Game indicates correct answer, shows album and line +4. Player clicks Next, go to 1 + +The player might not be always right. In that case the flow would be: + +1. Game shows random word taken from dataset +2. Player types track title into textbar, confirms +3. Game indicates wrong answer +4. Player tries again, go to 2; or clicks Next, go to 1 + +We need some hint mechanism so a clueless player has a chance of recalling +something. + +1. Game shows random word taken from dataset +2. Player does nothing, or makes incorrect guesses +3. Player clicks Hint +4. Game reveals some information about correct answer unless hints are + depleted. Go to 2 + +I wanted this game to be as pressure-free as possible. Therefore, players +can skip words or show answer at any time, and there are no scorekeeping +counters or timers. Every 50 guesses the players made, the game reminds +them to take a rest. + +Source code: [`index.js`](https://git.sr.ht/~fkfd/one_top_song/tree/main/item/index.js) + +## Step 8: Debugging + +The game was designed to run offline. The server, if any, is there just to +send you the HTML, stylesheet, and JavaScript for datasets and the game +itself. This means it is possible to do everything in a `file://` browser +tab. + +Because the web game is designed "mobile first" (but in a good way), +I tested the UI extensively with and without DevTools mobile emulator, and +on my phone. This way I figured out what interactions worked best on both +keyboard and touchscreen. + +As to the JavaScript, I did not exactly enjoy writing it, but it wasn't +hellish suffering either. I no longer "hate" JavaScript; I just want to +stay away from it from now on. I would describe my code as *pretty* +type-safe… until it isn't. + +## Step 9: Visualizing and having fun with the dataset + +No, it's not about fancy charts or scatter plots. I just thought it would be +helpful if we could display all the words in a table, so I made a webpage +for that. Fun fact: I gave up indentation for all the `<tr>` tags. +Otherwise there would be 28\*1002 = 28kB of wasted data. + +![Table of a few tracks, words that only appear in each one, and respective +lines](img/one_top_song/words_all.png) + +Then I thought, "hey, what if I pulled up a list of most frequently used +English words and compared that to those I found?" So I downloaded a list from +Wiktionary titled [Frequency +lists/TV/2006/1-1000](https://en.wiktionary.org/wiki/Wiktionary:Frequency_lists/TV/2006/1-1000) +which is the top 1000 words used in "a collection of TV and movie +scripts/transcripts" as of 2006. This time though, I made more use of Unix +tools. It worked like this (the 1000-word list was saved in file `1000`): + +``` +$ cut -f2 tracks_words # extract word from "track<tab>word" | sort > /tmp/top +$ sort 1000 > /tmp/freq +$ comm -12 /tmp/top /tmp/freq # find common words between the two files +ahead +anybody +anyway +... +``` + +And here we have the most frequent 88 words: + +![Table of a few words, and the track they are in](img/one_top_song/words_freq.png) + +I ran some more stupid analysis on the dataset and found that the only +song that had absolutely no unique word is _Truce_ (a bad day to the _Truce_ +fans out there, eh?), and songs closest to zero are _Before Your Start Your +Day_ and _Trees_, contributing 2 each. The figures go all the way up to 51: +_Neon Gravestones_, which is basically a rapped-out essay, has the most +expansive vocabulary among all tøp songs. I wrote all my interesting findings +in the trivia section for players to discover. + +The scripts I used to generate HTML: +[`data/mkhtml_all.py`](https://git.sr.ht/~fkfd/one_top_song/tree/main/item/data/mkhtml_all.py), +and [`data/mkhtml/freq.py`](https://git.sr.ht/~fkfd/one_top_song/tree/main/item/data/mkhtml_freq.py) + +The HTML: [`words.html`](https://git.sr.ht/~fkfd/one_top_song/tree/main/item/words.html) + +## Step 10: Deployment + +The only thing it takes to deploy a static website is `scp`. `rsync` if +you have lots of data. Let's calculate how much data we have to transfer. + +File | Size (kB) +------------|------------ +index.html | 9.6 +words.html | 88 +index.css | 1.7 +index.js | 6.5 +words.js | 94 +albums.js | 2.3 +img/\*.jpg | 115.4 +__Total__ | __317.5__ + +Incidentally, this is how much my game will consume from a player's data +plan. *I* think it's small enough for anyone. + +## Results + +On April 19, 2022, +I [published](https://www.reddit.com/r/twentyonepilots/comments/u68pzy/this_word_only_appears_in_one_twenty_%C3%B8ne_pil%C3%B8ts/) +a version I thought was stable enough to r/twentyonepilots. It went +reasonably popular. You can play it here: +[One tøp song](https://fkfd.me/toys/one_top_song/) + +Here's a demo video (2.0 MiB): + +<video controls> <source src="../img/one_top_song/demo.mp4" /> </video> + +The source code (MIT) is [here](https://git.sr.ht/~fkfd/one_top_song). If +you want, you can download lyrics to your favorite artists' songs and +generate your own dataset to play with. A redditor considered Taylor +Swift, and I'm looking forward to their progress. + +In conslusion, I think I did a pretty good job at extracting, +representing, and toying with data, but the process left a lot to improve. +NLP connoisseurs are gonna be mad at me for not using this and that +library, and some Unix guru might be capable of rewriting my Python +scripts with sed, awk, and jq. I do not care. The final product is one of +my better interactive web designs, made with no framework and minimal +assets. The game is not designed to be addictive, unlike +$insertGameNameHere. It is, after all, just for fun; in the disclaimer +I wrote that the game is "not a tool for gatekeeping." That's how things +are supposed to work. |