# One tøp song
2022-05-08
I'm a die-hard fan of twenty øne piløts (did you know they're a two piece
band?) You can see this from the fact that I take the trouble to stylize
the band name with ø's, even in its acronym, tøp. Therefore, you wouldn't
expect neutrality from this blogpost.
The band and its members, Tyler Joseph and Josh Dun, are known for
a Grammy and two all-gold records on RIAA, but to me they're irrelevant
(the awards, not the members). I like the vibe of their songs and
especially the lyrics. For example, take a look at the insightful final
lines from _Pet Cheetah_ (Trench) that build up to a pumping crescendo:
> Pet cheetah, cheetah
> Pet cheetah, cheetah
> Pet cheetah, cheetah
> Pet cheetah, cheetah
> Pet cheetah, cheetah
> Pet cheetah, cheetah
> Pet cheetah, cheetah
> Pet cheetah, cheetah
> Pet cheetah, cheetah
> Pet cheetah, cheetah
> Pet cheetah, cheetah
> Pet cheetah, cheetah
> Pet cheetah, cheetah
> Pet cheetah, cheetah
> Pet cheetah, cheetah
> Pet cheetah, cheetah
Whatever you say, I think it's a one-of-a-kind song that discusses making
music for a fanbase (yes they make a lot of meta songs like this). The
lines above are simple, but unique as well. Among all tøp songs, this is
the only one that features the word "pet", and also "cheetah", just like
how _Nico And The Niners_ is the only one with "Nico" and "Niners". Wait,
that's not right, because "Nico" appears earlier in the album, in the
second verse of _Morph_.
This brought me into thinking: How many words are there that appear in
only one twenty øne piløts song? And to pay off my efforts, can I turn
this into a fun game for other tøp fans to play?
For the impatient, you may skip all the procedures and technicality. Go
ahead and check out the [results](#results). Everyone else, please take
your time on your ride.
## Step 1: Download the lyrics
This isn't as easy as it seemed, nor is it too hard. The lyric provider is
azlyrics.com, because it works without JavaScript and serves
machine-readable HTML. So I went ahead and curl'd a random page.
```
$ curl https://www.azlyrics.com/lyrics/twentyonepilots/truce.html
302 Found
302 Found
nginx
```
OK, time to `man curl` for the option to follow redirections. It's `-L`,
btw. (HTML prettified)
```
$ curl -L https://www.azlyrics.com/lyrics/twentyonepilots/truce.html
AZLyrics - request for access
Access denied.
```
Damn, that's pretty… nasty, but it's exactly how I expected it to go. Now,
I've done a lot of web scraping, so I know it's possible to fake a few
HTTP headers to give curl some human skin. The most common headers are:
- Referer
- Cookie
- User-Agent
So I tried them one by one. User-Agent worked.
```
$ curl -L -H 'User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:100.0) Gecko/20100101 Firefox/100.0' \
https://www.azlyrics.com/lyrics/twentyonepilots/truce.html
```
What is extra funny though, is that the server accepts even an empty UA
string.
```
$ curl -L -H 'User-Agent: ' https://www.azlyrics.com/lyrics/twentyonepilots/truce.html
```
I couldn't resist:
```
$ curl -L -H 'User-Agent: definitely not curl' https://www.azlyrics.com/lyrics/twentyonepilots/truce.html
```
Above the lyrics an HTML comment reads:
> Usage of azlyrics.com content by any third-party lyrics provider is
> prohibited by our licensing agreement. Sorry about that.
This won't stop me because I can't read.
So long story short, I curl'd the twenty øne pilots index page and
BeautifulSoup'd all the title-URL pairs, which are once again curl'd and
BeautifulSoup'd.
Soon I have a directory full of [song title].txt, but not all of them are
useful. A few songs are not technically part of tøp's canon discography
(some fans are gonna disagree on this one but I don't care), like the
Elvis cover _Can't Help Falling In Love_, which is just a [YouTube
video](https://www.youtube.com/watch?v=6ThQkrXHdh4) of Tyler singing in
the street; another one, [_Coconut Sharks In The
Water_](https://youtu.be/jFwsnrkK9sU), although well-known among fans, was
only performed once for comical effect in 2011. In the end, I included
their six studio albums and five singles, totaling 79 songs.
On to step 2!
## Step 2: Look for every word
This is the core part of the project. I knew it's impossible by hand, so
I sat down to write an algorithm in Python. It goes like this in
pseudocode:
```
lyrics = dict()
for song in all_songs:
lyrics[song] = read(song + ".txt").split_words()
for song in all_songs:
other_songs = list(s in all_songs such that s != song)
for word in lyrics[song]:
for other_song in other_songs:
if lyrics[other_song].includes(word):
found = True
if not found:
append("results.txt", song + "\t" + word)
```
The latter block had three nested for loops. To optimize it a bit, I read
all files before hand, split each one up into individual words, then threw
them into a set to remove the duplicates. As for the third for loop,
I _could_ call `break` right after `found = True`, but instead resorted to
the magic of list comprehension (variable names and structure taken from
pseudocode above):
```
if any([(word in lyrics[o]) for o in other_songs]):
append("results.txt", song + "\t" + word)
```
I like to imagine Python optimized this one for me, but I'm not sure.
Anyway, even if it doesn't this shouldn't be too bad. Plus, I like
one-liners.
When splitting words, they are converted to lowercase. Punctuation marks
and suffixes like 's and 'd are removed, but I forgot to remove 've.
Fortunately there weren't many of them, so I removed them by hand.
You can read the real source code here:
[`data/one_song_words.py`](https://git.sr.ht/~fkfd/one_top_song/tree/main/item/data/one_song_words.py)
## Step 3: Dedupe
The previous step brought about a problem. The script I wrote treated
inflections as separate words, e.g. "vibe" (_Chlorine_), "vibes" and
"vibing" (_The Outside_). So I wrote a script to find most of them.
The script reports occurrences of the following inflections of `word`:
```
word + "s",
word + "es",
word + "d",
word + "ed",
word + "ing",
```
and also in reverse, if `word` is already inflected:
```
re.sub("s$", "", word),
re.sub("es$", "", word),
re.sub("d$", "", word),
re.sub("ed$", "", word),
re.sub("ing$", "", word),
```
And when I ran it, what happened is it caught most of the offenders — like
"vibe" vs. "vibes" — but not more subtle ones like "vibing". I ended up
removing them again by hand, but it's possible I missed some.
Why didn't I just tell the script to remove the inflections automatically?
Because there were false positives. For example, "sing" (_Bandito_ and
many others) and "singed" (_Leave The City_) are not the same thing. Other
examples include "to" and "toes", "she" and "shed", "not" and "notes",
"even" and "evening", etc. Also, although some pairs are of the same
origin, they're pretty different semantically, like "weathered"
(_Chlorine_) and "weather" (_Good Day_ and _Migraine_). Leaving these
alone, I axed everything else from my list.
Source code: [`data/dedupe.py`](https://git.sr.ht/~fkfd/one_top_song/tree/main/item/data/dedupe.py)
## Step 4: Manual inspection
It was at this moment that I realized that I had forgot about stuff like
"[x10]" (_Holding On To You_) that marks a repeated line. There were some
onomatopoeic words like "mm-mm" (_Choker_), too, and don't get me started
on hyphenated words: there were "treehouse" (_Forest_) and "tree-house"
(_Stressed Out_). Words like "migraine", which comes from a song titled
_Migraine_, are too easy for a game, so they are not included either.
I also capitalized proper nouns like "Monday", and removed trailing
periods and commas from every line I could find. In retrospect it could
have been easier if I sanitized the lyric files from the beginning. At
this moment there are 1,002 words left, but I don't know if there's more
to knock out. I doubt anyone will notice.
Here's a fun story: after I deployed the app (yes there'll be a web app at
the end) on r/twentyonepilots, one player reported an incorrect lyric from
_Migraine_:
> A difficult to be, stop feasting lumber-down trees
At first glance this lyric seemed unfamiliar to me, and it definitely
isn't grammatically correct. I checked multiple sources: on azlyrics of
course it's this one, but on
[Genius](https://genius.com/Twenty-one-pilots-migraine-lyrics) it says
otherwise:
> A difficult beast feasting on burnt down trees
Oops, better go check out the description from the [official
audio](https://www.youtube.com/watch?v=Bs92ejAGLdw) on Fueled By Ramen's
(tøp's label, FBR for short) YouTube channel:
> a difficult to be, stop feasting lumber down trees
And [this video at 14:40](https://youtu.be/HutQvZWJ_60?t=880) on Warner
Music Japan's channel with Japanese and English subtitles:
> 燒け落ちた木々貪り食う、気難しい野獸
> A difficult beast feasting on burnt down trees
Well, I tried.
So to settle this the only thing I could do was find out by myself.
I grabbed [WrightP's Official Acapella
version](https://www.youtube.com/watch?v=qGLEH_VeCpE) and extracted that
bit with Audacity. I slowed it down 50%, and it sounds like this:
Let me explain what I heard:
> A difficult-a beast-a feasting-on bur- down trees
The "ng" sound between "feasting" and "on" is audible. There is no "l"
sound as in "lumber-down", and there is no /ɒ/ or /ɑ/ sound following
"st", which rules out "stop".
That settles it: Genius and WMG Japan are right, azlyrics and FBR are
wrong. I suspect that azlyrics got its lyrics from FBR in the first place.
Track-word pairs:
[`data/track_words`](https://git.sr.ht/~fkfd/one_top_song/tree/main/item/data/tracks_words)
## Step 5: Generating a dataset
Now that I have a 1000-something-line-long file of tab-separated track titles
and unique words, it's time to generate a dataset for the game. Since I'll be
producing a web game, the language is gonna be JavaScript, so the dataset
will be in JSON. The first challenge is we need to know the line from
which each word came from. This way if the player fails to recall it,
we'll show them the line and they will go "hmm, yeah, Tyler really *did*
sing this". But you see, my step 2 script completely scrambled the lyrics.
So I wrote another Python script to "grep" them from the giant heap of txt
files. It was pretty easy, and moments later I have this JSON file
structured like this:
```
[
{
"track": "Redecorate",
"word": "blankets",
"lines": [
"Then one night she got cold with no blankets on her bed",
"Blankets over mirrors, she tends to like it"
]
},
{...},{...},...
]
```
I should try to shrink the 135kB (kilo, not kibi) dataset. First, the
prettyprint was unnecessary, so let's do away with it. It instantly went down
to 99kB. However, having everything on one line makes batch editing in vim
a huge pain, and every launch took seconds. So as a compromise I inserted
a linebreak after every word object, so for x words there would be (x+2)
lines including the brackets. 1kB well spent. The JSON file is now a neat
100kB, which is a 26% optimization compared to the initial 135kB.
However, as I was coding JavaScript I realized that, since we're using the
dataset as a JavaScript object, we don't have to play by JSON's rules.
This means no more double quotes around keys! Each word object has
6 double quotes, 6 times 1000 is… 6kB! That's right, we just shrank the
dataset to 94kB. Now that's a 30% optimization. All by frugal management
of whitespace.
Later I found it would be better if I tagged the _album_ to each word, but
it would be super redundant. So instead, I placed lists of tracks in each
album inside another JS file that is load alongside the words.
JSON generator script: [`data/mkjson.py`](https://git.sr.ht/~fkfd/one_top_song/tree/main/item/data/mkjson.py)
Datasets: [`data/words.json`](https://git.sr.ht/~fkfd/one_top_song/tree/main/item/data/words.json),
[`words.js`](https://git.sr.ht/~fkfd/one_top_song/tree/main/item/words.js),
and [`albums.js`](https://git.sr.ht/~fkfd/one_top_song/tree/main/item/albums.js)
## Step 6: Design the game UI
I thought I despised the "mobile first" approach, but it turns out what I hated
was the "mobile only" garbage. [Mobile Wikipedia] actually works remarkably
well on desktop. What I'm doing is so much simpler than Wikipedia. The page
contains the following fundamental elements:
- the word
- textbox for user input
- candidate list
- controls
I swear, the desktop version works just as smoothly as on mobile (although
I failed to center a few elements).
![Desktop UI](img/one_top_song/ui_desktop.png)
![Mobile UI](img/one_top_song/ui_mobile.png)
▲ Notice that "twenty øne piløts" are joined with non-breaking spaces
And my absolute favorite thing here is the candidate list. I wouldn't expect
anyone to type "House Of Gold" in its entirety, would I? Of course there should
be some sort of search suggestion. The candidate list I implemented tries to
match user input against the beginning of each song title, as well as acronyms.
For example "hot" gives you _Holding On To You_. A hack was written for
_Heavydirtysoul_ so that "hds" would match it.
Oh, I almost forgot: the three buttons are twenty øne piløts-themed.
![The classic |-/ logo: blue vertical bar, black dash, and red slash](img/one_top_song/top_logo.png)
▲ Former tøp logo from the Regional at Best era
## Step 7: Game logic
From this point there's no repetitive chores, and I can finally focus on
making a game. The concept is simple: the player tries to guess the song
that a word came from.
Let me enumerate the steps in which the player would interact with my game:
1. Game shows random word taken from dataset
2. Player types track title into textbar, confirms
3. Game indicates correct answer, shows album and line
4. Player clicks Next, go to 1
The player might not be always right. In that case the flow would be:
1. Game shows random word taken from dataset
2. Player types track title into textbar, confirms
3. Game indicates wrong answer
4. Player tries again, go to 2; or clicks Next, go to 1
We need some hint mechanism so a clueless player has a chance of recalling
something.
1. Game shows random word taken from dataset
2. Player does nothing, or makes incorrect guesses
3. Player clicks Hint
4. Game reveals some information about correct answer unless hints are
depleted. Go to 2
I wanted this game to be as pressure-free as possible. Therefore, players
can skip words or show answer at any time, and there are no scorekeeping
counters or timers. Every 50 guesses the players made, the game reminds
them to take a rest.
Source code: [`index.js`](https://git.sr.ht/~fkfd/one_top_song/tree/main/item/index.js)
## Step 8: Debugging
The game was designed to run offline. The server, if any, is there just to
send you the HTML, stylesheet, and JavaScript for datasets and the game
itself. This means it is possible to do everything in a `file://` browser
tab.
Because the web game is designed "mobile first" (but in a good way),
I tested the UI extensively with and without DevTools mobile emulator, and
on my phone. This way I figured out what interactions worked best on both
keyboard and touchscreen.
As to the JavaScript, I did not exactly enjoy writing it, but it wasn't
hellish suffering either. I no longer "hate" JavaScript; I just want to
stay away from it from now on. I would describe my code as *pretty*
type-safe… until it isn't.
## Step 9: Visualizing and having fun with the dataset
No, it's not about fancy charts or scatter plots. I just thought it would be
helpful if we could display all the words in a table, so I made a webpage
for that. Fun fact: I gave up indentation for all the `
` tags.
Otherwise there would be 28\*1002 = 28kB of wasted data.
![Table of a few tracks, words that only appear in each one, and respective
lines](img/one_top_song/words_all.png)
Then I thought, "hey, what if I pulled up a list of most frequently used
English words and compared that to those I found?" So I downloaded a list from
Wiktionary titled [Frequency
lists/TV/2006/1-1000](https://en.wiktionary.org/wiki/Wiktionary:Frequency_lists/TV/2006/1-1000)
which is the top 1000 words used in "a collection of TV and movie
scripts/transcripts" as of 2006. This time though, I made more use of Unix
tools. It worked like this (the 1000-word list was saved in file `1000`):
```
$ cut -f2 tracks_words # extract word from "trackword" | sort > /tmp/top
$ sort 1000 > /tmp/freq
$ comm -12 /tmp/top /tmp/freq # find common words between the two files
ahead
anybody
anyway
...
```
And here we have the most frequent 88 words:
![Table of a few words, and the track they are in](img/one_top_song/words_freq.png)
I ran some more stupid analysis on the dataset and found that the only
song that had absolutely no unique word is _Truce_ (a bad day to the _Truce_
fans out there, eh?), and songs closest to zero are _Before Your Start Your
Day_ and _Trees_, contributing 2 each. The figures go all the way up to 51:
_Neon Gravestones_, which is basically a rapped-out essay, has the most
expansive vocabulary among all tøp songs. I wrote all my interesting findings
in the trivia section for players to discover.
The scripts I used to generate HTML:
[`data/mkhtml_all.py`](https://git.sr.ht/~fkfd/one_top_song/tree/main/item/data/mkhtml_all.py),
and [`data/mkhtml/freq.py`](https://git.sr.ht/~fkfd/one_top_song/tree/main/item/data/mkhtml_freq.py)
The HTML: [`words.html`](https://git.sr.ht/~fkfd/one_top_song/tree/main/item/words.html)
## Step 10: Deployment
The only thing it takes to deploy a static website is `scp`. `rsync` if
you have lots of data. Let's calculate how much data we have to transfer.
File | Size (kB)
------------|------------
index.html | 9.6
words.html | 88
index.css | 1.7
index.js | 6.5
words.js | 94
albums.js | 2.3
img/\*.jpg | 115.4
__Total__ | __317.5__
Incidentally, this is how much my game will consume from a player's data
plan. *I* think it's small enough for anyone.
## Results
On April 19, 2022,
I [published](https://www.reddit.com/r/twentyonepilots/comments/u68pzy/this_word_only_appears_in_one_twenty_%C3%B8ne_pil%C3%B8ts/)
a version I thought was stable enough to r/twentyonepilots. It went
reasonably popular. You can play it here:
[One tøp song](https://fkfd.me/toys/one_top_song/)
Here's a demo video (2.0 MiB):
The source code (MIT) is [here](https://git.sr.ht/~fkfd/one_top_song). If
you want, you can download lyrics to your favorite artists' songs and
generate your own dataset to play with. A redditor considered Taylor
Swift, and I'm looking forward to their progress.
In conslusion, I think I did a pretty good job at extracting,
representing, and toying with data, but the process left a lot to improve.
NLP connoisseurs are gonna be mad at me for not using this and that
library, and some Unix guru might be capable of rewriting my Python
scripts with sed, awk, and jq. I do not care. The final product is one of
my better interactive web designs, made with no framework and minimal
assets. The game is not designed to be addictive, unlike
$insertGameNameHere. It is, after all, just for fun; in the disclaimer
I wrote that the game is "not a tool for gatekeeping." That's how things
are supposed to work.