summaryrefslogtreecommitdiff
path: root/docs/projects/one_top_song.md
blob: 761b769aa8bbeeb5578b4875d279a55854d1022d (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
# One tøp song

2022-05-08

I'm a die-hard fan of twenty øne piløts (did you know they're a two piece
band?) You can see this from the fact that I take the trouble to stylize
the band name with ø's, even in its acronym, tøp. Therefore, you wouldn't
expect neutrality from this blogpost.

The band and its members, Tyler Joseph and Josh Dun, are known for
a Grammy and two all-gold records on RIAA, but to me they're irrelevant
(the awards, not the members). I like the vibe of their songs and
especially the lyrics. For example, take a look at the insightful final
lines from _Pet Cheetah_ (Trench) that build up to a pumping crescendo:

> Pet cheetah, cheetah  
> Pet cheetah, cheetah  
> Pet cheetah, cheetah  
> Pet cheetah, cheetah  
> Pet cheetah, cheetah  
> Pet cheetah, cheetah  
> Pet cheetah, cheetah  
> Pet cheetah, cheetah  
> Pet cheetah, cheetah  
> Pet cheetah, cheetah  
> Pet cheetah, cheetah  
> Pet cheetah, cheetah  
> Pet cheetah, cheetah  
> Pet cheetah, cheetah  
> Pet cheetah, cheetah  
> Pet cheetah, cheetah  

Whatever you say, I think it's a one-of-a-kind song that discusses making
music for a fanbase (yes they make a lot of meta songs like this). The
lines above are simple, but unique as well. Among all tøp songs, this is
the only one that features the word "pet", and also "cheetah", just like
how _Nico And The Niners_ is the only one with "Nico" and "Niners". Wait,
that's not right, because "Nico" appears earlier in the album, in the
second verse of _Morph_.

This brought me into thinking: How many words are there that appear in
only one twenty øne piløts song? And to pay off my efforts, can I turn
this into a fun game for other tøp fans to play?

For the impatient, you may skip all the procedures and technicality. Go
ahead and check out the [results](#results). Everyone else, please take
your time on your ride.

## Step 1: Download the lyrics

This isn't as easy as it seemed, nor is it too hard. The lyric provider is
azlyrics.com, because it works without JavaScript and serves
machine-readable HTML. So I went ahead and curl'd a random page.

```
$ curl https://www.azlyrics.com/lyrics/twentyonepilots/truce.html
<html>
<head><title>302 Found</title></head>
<body>
<center><h1>302 Found</h1></center>
<hr><center>nginx</center>
</body>
</html>
```

OK, time to `man curl` for the option to follow redirections. It's `-L`,
btw. (HTML prettified)

```
$ curl -L https://www.azlyrics.com/lyrics/twentyonepilots/truce.html
<!DOCTYPE html>
<html lang="en">
  <head>
    <!-- some meta tags -->
    <title>AZLyrics - request for access</title>
    <!-- some stylesheets -->
    <!-- some <IE9 compat scripts -->
    <!-- jquery and the like -->
    <!-- recaptcha script -->
  </head>

  <body>
    <nav>...</nav>
    <!-- a commented out banner -->

    <!-- a few nested divs -->
            Access denied.
    <!-- end nested divs -->

    <!-- a commented out block with the note "bot ban" -->

    <!-- footer -->
  </body>
</html>
```

Damn, that's pretty… nasty, but it's exactly how I expected it to go. Now,
I've done a lot of web scraping, so I know it's possible to fake a few
HTTP headers to give curl some human skin. The most common headers are:

- Referer
- Cookie
- User-Agent

So I tried them one by one. User-Agent worked.

```
$ curl -L -H 'User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:100.0) Gecko/20100101 Firefox/100.0' \
    https://www.azlyrics.com/lyrics/twentyonepilots/truce.html
```

What is extra funny though, is that the server accepts even an empty UA
string.

```
$ curl -L -H 'User-Agent: ' https://www.azlyrics.com/lyrics/twentyonepilots/truce.html
```

I couldn't resist:

```
$ curl -L -H 'User-Agent: definitely not curl' https://www.azlyrics.com/lyrics/twentyonepilots/truce.html
```

Above the lyrics an HTML comment reads:

> Usage of azlyrics.com content by any third-party lyrics provider is
> prohibited by our licensing agreement. Sorry about that.

This won't stop me because I can't read.

So long story short, I curl'd the twenty øne pilots index page and
BeautifulSoup'd all the title-URL pairs, which are once again curl'd and
BeautifulSoup'd.

Soon I have a directory full of [song title].txt, but not all of them are
useful. A few songs are not technically part of tøp's canon discography
(some fans are gonna disagree on this one but I don't care), like the
Elvis cover _Can't Help Falling In Love_, which is just a [YouTube
video](https://www.youtube.com/watch?v=6ThQkrXHdh4) of Tyler singing in
the street; another one, [_Coconut Sharks In The
Water_](https://youtu.be/jFwsnrkK9sU), although well-known among fans, was
only performed once for comical effect in 2011. In the end, I included
their six studio albums and five singles, totaling 79 songs.

On to step 2!

## Step 2: Look for every word

This is the core part of the project. I knew it's impossible by hand, so
I sat down to write an algorithm in Python. It goes like this in
pseudocode:

```
lyrics = dict()
for song in all_songs:
    lyrics[song] = read(song + ".txt").split_words()

for song in all_songs:
    other_songs = list(s in all_songs such that s != song)
    for word in lyrics[song]:
        for other_song in other_songs:
            if lyrics[other_song].includes(word):
                found = True

        if not found:
            append("results.txt", song + "\t" + word)
```

The latter block had three nested for loops. To optimize it a bit, I read
all files before hand, split each one up into individual words, then threw
them into a set to remove the duplicates. As for the third for loop,
I _could_ call `break` right after `found = True`, but instead resorted to
the magic of list comprehension (variable names and structure taken from
pseudocode above):

```
        if any([(word in lyrics[o]) for o in other_songs]):
            append("results.txt", song + "\t" + word)
```

I like to imagine Python optimized this one for me, but I'm not sure.
Anyway, even if it doesn't this shouldn't be too bad. Plus, I like
one-liners.

When splitting words, they are converted to lowercase. Punctuation marks
and suffixes like 's and 'd are removed, but I forgot to remove 've.
Fortunately there weren't many of them, so I removed them by hand.

You can read the real source code here:
[`data/one_song_words.py`](https://git.sr.ht/~fkfd/one_top_song/tree/main/item/data/one_song_words.py)

## Step 3: Dedupe

The previous step brought about a problem. The script I wrote treated
inflections as separate words, e.g. "vibe" (_Chlorine_), "vibes" and
"vibing" (_The Outside_). So I wrote a script to find most of them.

The script reports occurrences of the following inflections of `word`:

```
word + "s",
word + "es",
word + "d",
word + "ed",
word + "ing",
```

and also in reverse, if `word` is already inflected:

```
re.sub("s$", "", word),
re.sub("es$", "", word),
re.sub("d$", "", word),
re.sub("ed$", "", word),
re.sub("ing$", "", word),
```

And when I ran it, what happened is it caught most of the offenders — like
"vibe" vs. "vibes" — but not more subtle ones like "vibing". I ended up
removing them again by hand, but it's possible I missed some.

Why didn't I just tell the script to remove the inflections automatically?
Because there were false positives. For example, "sing" (_Bandito_ and
many others) and "singed" (_Leave The City_) are not the same thing. Other
examples include "to" and "toes", "she" and "shed", "not" and "notes",
"even" and "evening", etc. Also, although some pairs are of the same
origin, they're pretty different semantically, like "weathered"
(_Chlorine_) and "weather" (_Good Day_ and _Migraine_). Leaving these
alone, I axed everything else from my list.

Source code: [`data/dedupe.py`](https://git.sr.ht/~fkfd/one_top_song/tree/main/item/data/dedupe.py)

## Step 4: Manual inspection

It was at this moment that I realized that I had forgot about stuff like
"[x10]" (_Holding On To You_) that marks a repeated line. There were some
onomatopoeic words like "mm-mm" (_Choker_), too, and don't get me started
on hyphenated words: there were "treehouse" (_Forest_) and "tree-house"
(_Stressed Out_). Words like "migraine", which comes from a song titled
_Migraine_, are too easy for a game, so they are not included either.
I also capitalized proper nouns like "Monday", and removed trailing
periods and commas from every line I could find. In retrospect it could
have been easier if I sanitized the lyric files from the beginning. At
this moment there are 1,002 words left, but I don't know if there's more
to knock out. I doubt anyone will notice.

Here's a fun story: after I deployed the app (yes there'll be a web app at
the end) on r/twentyonepilots, one player reported an incorrect lyric from
_Migraine_:

> A difficult to be, stop feasting lumber-down trees

At first glance this lyric seemed unfamiliar to me, and it definitely
isn't grammatically correct. I checked multiple sources: on azlyrics of
course it's this one, but on
[Genius](https://genius.com/Twenty-one-pilots-migraine-lyrics) it says
otherwise:

> A difficult beast feasting on burnt down trees

Oops, better go check out the description from the [official
audio](https://www.youtube.com/watch?v=Bs92ejAGLdw) on Fueled By Ramen's
(tøp's label, FBR for short) YouTube channel:

> a difficult to be, stop feasting lumber down trees

And [this video at 14:40](https://youtu.be/HutQvZWJ_60?t=880) on Warner
Music Japan's channel with Japanese and English subtitles:

> 燒け落ちた木々貪り食う、気難しい野獸  
> A difficult beast feasting on burnt down trees

Well, I tried.

So to settle this the only thing I could do was find out by myself.
I grabbed [WrightP's Official Acapella
version](https://www.youtube.com/watch?v=qGLEH_VeCpE) and extracted that
bit with Audacity. I slowed it down 50%, and it sounds like this:

<audio controls src="../img/one_top_song/difficult_beast.mp3"></audio>

Let me explain what I heard:

> A difficult-a beast-a feasting-on bur- down trees

The "ng" sound between "feasting" and "on" is audible. There is no "l"
sound as in "lumber-down", and there is no /ɒ/ or /ɑ/ sound following
"st", which rules out "stop".

That settles it: Genius and WMG Japan are right, azlyrics and FBR are
wrong. I suspect that azlyrics got its lyrics from FBR in the first place.

Track-word pairs:
[`data/track_words`](https://git.sr.ht/~fkfd/one_top_song/tree/main/item/data/tracks_words)

## Step 5: Generating a dataset

Now that I have a 1000-something-line-long file of tab-separated track titles
and unique words, it's time to generate a dataset for the game. Since I'll be
producing a web game, the language is gonna be JavaScript, so the dataset
will be in JSON. The first challenge is we need to know the line from
which each word came from. This way if the player fails to recall it,
we'll show them the line and they will go "hmm, yeah, Tyler really *did*
sing this". But you see, my step 2 script completely scrambled the lyrics.
So I wrote another Python script to "grep" them from the giant heap of txt
files. It was pretty easy, and moments later I have this JSON file
structured like this:

```
[
  {
    "track": "Redecorate",
    "word": "blankets",
    "lines": [
      "Then one night she got cold with no blankets on her bed",
      "Blankets over mirrors, she tends to like it"
    ]
  },
  {...},{...},...
]
```

I should try to shrink the 135kB (kilo, not kibi) dataset. First, the
prettyprint was unnecessary, so let's do away with it. It instantly went down
to 99kB. However, having everything on one line makes batch editing in vim
a huge pain, and every launch took seconds. So as a compromise I inserted
a linebreak after every word object, so for x words there would be (x+2)
lines including the brackets. 1kB well spent. The JSON file is now a neat
100kB, which is a 26% optimization compared to the initial 135kB.

However, as I was coding JavaScript I realized that, since we're using the
dataset as a JavaScript object, we don't have to play by JSON's rules.
This means no more double quotes around keys! Each word object has
6 double quotes, 6 times 1000 is… 6kB! That's right, we just shrank the
dataset to 94kB. Now that's a 30% optimization. All by frugal management
of whitespace.

Later I found it would be better if I tagged the _album_ to each word, but
it would be super redundant. So instead, I placed lists of tracks in each
album inside another JS file that is load alongside the words.

JSON generator script: [`data/mkjson.py`](https://git.sr.ht/~fkfd/one_top_song/tree/main/item/data/mkjson.py)

Datasets: [`data/words.json`](https://git.sr.ht/~fkfd/one_top_song/tree/main/item/data/words.json),
[`words.js`](https://git.sr.ht/~fkfd/one_top_song/tree/main/item/words.js),
and [`albums.js`](https://git.sr.ht/~fkfd/one_top_song/tree/main/item/albums.js)

## Step 6: Design the game UI

I thought I despised the "mobile first" approach, but it turns out what I hated
was the "mobile only" garbage. [Mobile Wikipedia] actually works remarkably
well on desktop. What I'm doing is so much simpler than Wikipedia. The page
contains the following fundamental elements:

- the word
- textbox for user input
- candidate list
- controls

I swear, the desktop version works just as smoothly as on mobile (although
I failed to center a few elements).

![Desktop UI](img/one_top_song/ui_desktop.png)

![Mobile UI](img/one_top_song/ui_mobile.png)

▲ Notice that "twenty øne piløts" are joined with non-breaking spaces

And my absolute favorite thing here is the candidate list. I wouldn't expect
anyone to type "House Of Gold" in its entirety, would I? Of course there should
be some sort of search suggestion. The candidate list I implemented tries to
match user input against the beginning of each song title, as well as acronyms.
For example "hot" gives you _Holding On To You_. A hack was written for
_Heavydirtysoul_ so that "hds" would match it.

Oh, I almost forgot: the three buttons are twenty øne piløts-themed.

![The classic |-/ logo: blue vertical bar, black dash, and red slash](img/one_top_song/top_logo.png)

▲ Former tøp logo from the Regional at Best era

## Step 7: Game logic

From this point there's no repetitive chores, and I can finally focus on
making a game. The concept is simple: the player tries to guess the song
that a word came from.

Let me enumerate the steps in which the player would interact with my game:

1. Game shows random word taken from dataset
2. Player types track title into textbar, confirms
3. Game indicates correct answer, shows album and line
4. Player clicks Next, go to 1

The player might not be always right. In that case the flow would be:

1. Game shows random word taken from dataset
2. Player types track title into textbar, confirms
3. Game indicates wrong answer
4. Player tries again, go to 2; or clicks Next, go to 1

We need some hint mechanism so a clueless player has a chance of recalling
something.

1. Game shows random word taken from dataset
2. Player does nothing, or makes incorrect guesses
3. Player clicks Hint
4. Game reveals some information about correct answer unless hints are
   depleted. Go to 2

I wanted this game to be as pressure-free as possible. Therefore, players
can skip words or show answer at any time, and there are no scorekeeping
counters or timers. Every 50 guesses the players made, the game reminds
them to take a rest.

Source code: [`index.js`](https://git.sr.ht/~fkfd/one_top_song/tree/main/item/index.js)

## Step 8: Debugging

The game was designed to run offline. The server, if any, is there just to
send you the HTML, stylesheet, and JavaScript for datasets and the game
itself. This means it is possible to do everything in a `file://` browser
tab.

Because the web game is designed "mobile first" (but in a good way),
I tested the UI extensively with and without DevTools mobile emulator, and
on my phone. This way I figured out what interactions worked best on both
keyboard and touchscreen.

As to the JavaScript, I did not exactly enjoy writing it, but it wasn't
hellish suffering either. I no longer "hate" JavaScript; I just want to
stay away from it from now on. I would describe my code as *pretty*
type-safe… until it isn't.

## Step 9: Visualizing and having fun with the dataset

No, it's not about fancy charts or scatter plots. I just thought it would be
helpful if we could display all the words in a table, so I made a webpage
for that. Fun fact: I gave up indentation for all the `<tr>` tags.
Otherwise there would be 28\*1002 = 28kB of wasted data.

![Table of a few tracks, words that only appear in each one, and respective
lines](img/one_top_song/words_all.png)

Then I thought, "hey, what if I pulled up a list of most frequently used
English words and compared that to those I found?" So I downloaded a list from
Wiktionary titled [Frequency
lists/TV/2006/1-1000](https://en.wiktionary.org/wiki/Wiktionary:Frequency_lists/TV/2006/1-1000)
which is the top 1000 words used in "a collection of TV and movie
scripts/transcripts" as of 2006. This time though, I made more use of Unix
tools. It worked like this (the 1000-word list was saved in file `1000`):

```
$ cut -f2 tracks_words  # extract word from "track<tab>word" | sort > /tmp/top
$ sort 1000 > /tmp/freq
$ comm -12 /tmp/top /tmp/freq  # find common words between the two files
ahead
anybody
anyway
...
```

And here we have the most frequent 88 words:

![Table of a few words, and the track they are in](img/one_top_song/words_freq.png)

I ran some more stupid analysis on the dataset and found that the only
song that had absolutely no unique word is _Truce_ (a bad day to the _Truce_
fans out there, eh?), and songs closest to zero are _Before Your Start Your
Day_ and _Trees_, contributing 2 each. The figures go all the way up to 51:
_Neon Gravestones_, which is basically a rapped-out essay, has the most
expansive vocabulary among all tøp songs. I wrote all my interesting findings
in the trivia section for players to discover.

The scripts I used to generate HTML:
[`data/mkhtml_all.py`](https://git.sr.ht/~fkfd/one_top_song/tree/main/item/data/mkhtml_all.py),
and [`data/mkhtml/freq.py`](https://git.sr.ht/~fkfd/one_top_song/tree/main/item/data/mkhtml_freq.py)

The HTML: [`words.html`](https://git.sr.ht/~fkfd/one_top_song/tree/main/item/words.html)

## Step 10: Deployment

The only thing it takes to deploy a static website is `scp`. `rsync` if
you have lots of data. Let's calculate how much data we have to transfer.

File        | Size (kB)
------------|------------
index.html  | 9.6
words.html  | 88
index.css   | 1.7
index.js    | 6.5
words.js    | 94
albums.js   | 2.3
img/\*.jpg  | 115.4
__Total__   | __317.5__

Incidentally, this is how much my game will consume from a player's data
plan. *I* think it's small enough for anyone.

## Results

On April 19, 2022,
I [published](https://www.reddit.com/r/twentyonepilots/comments/u68pzy/this_word_only_appears_in_one_twenty_%C3%B8ne_pil%C3%B8ts/)
a version I thought was stable enough to r/twentyonepilots. It went
reasonably popular. You can play it here:
[One tøp song](https://fkfd.me/toys/one_top_song/)

Here's a demo video (2.0 MiB):

<video controls> <source src="../img/one_top_song/demo.mp4" /> </video>

The source code (MIT) is [here](https://git.sr.ht/~fkfd/one_top_song). If
you want, you can download lyrics to your favorite artists' songs and
generate your own dataset to play with. A redditor considered Taylor
Swift, and I'm looking forward to their progress.

In conslusion, I think I did a pretty good job at extracting,
representing, and toying with data, but the process left a lot to improve.
NLP connoisseurs are gonna be mad at me for not using this and that
library, and some Unix guru might be capable of rewriting my Python
scripts with sed, awk, and jq. I do not care. The final product is one of
my better interactive web designs, made with no framework and minimal
assets. The game is not designed to be addictive, unlike
$insertGameNameHere. It is, after all, just for fun; in the disclaimer
I wrote that the game is "not a tool for gatekeeping." That's how things
are supposed to work.