Song Section Prediction using the Billboard Data Set


Coming into my second semester as a Masters student in the Music Technology program at McGill University, I was eager to use the Billboard Data Set, created and curated by the lab to which I belong, and my recently tuned Machine Learning skills to perform some musicological or music compositional task. In the previous semester, I had some success creating an automatic chord progression generator using an extension of a Hidden Markov Model in conjunction with David Temperley's Rock Corpus data set. Similar to David Temperley, I like to use probabilistic models for musical analysis of large symbolic data sets, and I think probabilistic representations are flexible enough to represent underlying musical structures accurately. My graduate supervisor, Ichiro Fujinaga, was happy to supervise me in this project. Similarly, the Machine Learning professor from whom I had taken the graduate level Machine Learning class in the previous semester, Doina Precup, was also gracious enough to co-supervise.

Framing the Problem

The original idea was this:

Given a sequence of chord labels from any part of a randomly selected song, could a computer identify the type of section in which the chords occurred?

Part of the reasoning was that the data set provides all of the information necessary to make this determination, over a large set of song. The Billboard set includes both named section (e.g., "verse" or "chorus") and lettered section (e.g, "A", "B'") information, along with the timing of the sections.

Later we'll see that the definition of this problem needed to be reworked, and simplified, in order to create a more realistic machine learning task.

Digging Into the Data

The first step, as advised by Ichiro, is to find out if humans can perform the task. The motivation for this is to set a benchmark that my algorithm should aspire to attain, or to surpass. For example, if humans can estimate the song section from a random chord progression 50% of the time, then it should be my goal to make a program to surpass that. If however, it's more like 10% of the time, then the task might be impossible.

I began to go through the songs, and to test myself on whether or not I could identify the song section from selecting a random chord in the song, and guessing which section in which it occurred. The problem was that I had a human-readable song format, and I couldn't eliminate all of the other information that I could automatically see in the file, when trying to isolate the task of estimating song section. For example, each of the songs are organized by a harmonic phrase for each line. This presents a problem, because you automatically know what part of the phrase the random chord you have selected is in just by looking at the chord in the song file. Similarly, the section information is also visible. To remedy these problems, and to really create a test that more closely follows the scientific method, I decided to create a website which would randomly select a song, as well as a random index in that song, and present the user with a chord progression starting at that index. The user then would be able to guess at the song section, and I could record this data in order to compare it to my future algorithm's performance. This web tool is located here. I'll be developing that continuously as the semester develops.


In order to decide which machine learning algorithm to use, Professor Precup recommended that I first compute statistics that I thought were relevant to the problem. In order to get the ball rolling, I decided to simply compute the chord distributions for each possible section over the entire song set, as they exist in their original format.
To start this process, I first computed all of the possible song section names in the entire data set. Unfortunately, they weren't all standardized, so some post-processing was necessary to really represent the true chord distributions of common sections. The list of possible section names are as follows:

possibleSections = ['A', 'intro', ' ', 'B', 'verse', 'interlude', 'chorus', 'outro', 'C', 'solo', 'D', 'fadeout', 
		 'instrumental', 'J', 'pre-chorus', 'E', 'bridge', "B'", "A'", 'coda', 'pre-verse', "C'", "D'", 'trans', 
		 'transition', 'instrumental break', 'fade out', 'fadeout', 'pre chorus', 'F', 'G', 'main theme', 
		 '(secondary) theme', "B''", "A''", 'verse one', 'verse two', 'verse three', 'verse four', 'verse five',
		 'key change', 'H', 'vocal', ' intro-a', 'intro-b', "E'", 'I', 'fadein', 'prechorus', "C''", 'K', "A'''",
		 'solo', "B'''", 'fade in', 'outro', 'secondary theme', "D''", "D'''", 'spoken verse', "H'", 'pre-intro',
		 'spoken', 'intro b', "D''''", 'intro a', 'ending', 'chorus a', 'chorus b', 'Z', "B''''"]

You can see that this includes both the named sections and the lettered sections, as I mentioned previously. I also created the set of only named sections:

possibleSectionsNamesOnly = ['intro', 'verse', 'interlude', 'chorus', 'outro', 'solo', 'fadeout', 'instrumental',
		'pre-chorus', 'bridge', 'coda', 'pre-verse', 'trans', 'transition', 'instrumental break', 'fade out', 
		'fadeout', 'pre chorus', 'main theme', '(secondary) theme', 'verse one', 'verse two', 'verse three', 
		'verse four', 'verse five', 'key change', 'vocal', 'intro-a', 'intro-b', 'fadein', 'prechorus', 'solo',
		'fade in', 'outro', 'secondary theme', 'spoken verse','pre-intro', 'spoken', 'intro b','intro a', 'ending',
		'chorus a', 'chorus b']

The problem with this list, even when we isolate the named sections, is that there are still some section names that could obviously be grouped. For example, intro-a and intro-b could most likely both be considered to be an intro section. It is likely that the transcriber labelled them as such because a single song had two different forms of an intro, and he or she wanted to retain that distinction in the format. For my purposes, however, the fact that an intro happens twice in the same song, and has different forms is not as important as is grouping the section names into similar-function section groups.


I used a set of python parsers that a colleague, Alastair Porter had created for the billboard set (they are hosted here on github). Unfortunately, the section information was not automatically included in the output. Also, they were developed for a previous format of the Billboard data set. Therefore, some changes needed to be made. However, they were a boon for parsing the chord labels automatically- thanks Alastair! After tweaking the parsers, I successfully parsed the data set into json files. For every song, there is a chords.json file, as well as a sections.json file. The following is the example original Billboard Data Set format for the song, "Maybe I'm Amazed" by Paul McCartney, with the corresponding parsed output for chords and sections:

Original Song File
Chord Labels

The songs themselves were all given numbers from 0001-1000. The folder that contains all of the files that represent the transcription for that song, as well as all its meta-data is named with its given number. When I reference songs, I will often be using only this unique number identifier, as it would be cumbersome to look up the song name for each reference, especially when the song name is not important to the task at hand.


In generating chord distributions, I chose 2 songs and created the chord distributions by section by hand. The first song was simple (0003, James Brown, I Don’t Mind), and my generated results were exactly the same as my handmade results. The second example (0004, Funky Nassau, The Beginning of the End), was much more complex, with phrase repeat signs, as well as silent bars, and vamp sections. When I compared my handmade tally with the generated results, there were 2 discrepancies. Upon further review, I realized I had made a mistake in my handmade distribution (corrected version), and that the generated result was, in fact, correct!


Given these json representations of the data, as well as their timing information, I was able to line up each song section with every chord contained in that section for every type of song section over every song. I computed the chord distributions for every section type. Here are the chord distributions for the main types of song sections:

It's evident that the most frequent 4 chords are consistent with the frequency distribution of chords over all songs from Ashley Burgoyne's ISMIR 2011 paper, "An Expert Ground-truth Set for Audio Chord Recognition and Music Analysis" Chord frequency distribution over all songs

Chord Distribution Analysis

What is interesting about the section-specific distributions is that all of the different sections have the same 4 chords as the most frequent chords per section, but the order of the main 4 chords is not consistent.

Intro:   (C, G, A, D)
Verse:   (G, C, D, A)
Chorus:  (A, G, D, C)
Bridge:  (G, A, C, D)
Outro:   (A, D, G, C)

This could be due to a few different things:

  1. For one, the repetition of a phrase within sections is not considered to be the same as its previous occurrence. The chord frequency distributions will count every individual chord that is played, even if it is part of a chord phrase repeating many times, say, for a guitar solo. I would argue that a long sequence of exact repetitions does not reflect the true chord distribution of a song. In fact, a repetition of a given phrase, unless found at a different part of the song, is not relevant to the form of the song. In order to further investigate this hypothesis, I decided to look for long sequences of chord phrase repetitions in the data set. Luckily, the Billboard format has a short-hand for repeated phrases.
    If a phrase is immediately repeated N number of times, the phrase will be written once with xN at the end of it. Using a bit of bash scripting to find all of these places with significant repetition over all the songs, I discovered that the label A:maj is indeed repeated many times.

    for (( i=10; i<100; i++ ))
    > do
    > grep -r -i x$i ./
    > done

    This query provided some very interesting results. In observing simply the highest number of repetitions for any phrase in any song, I found these as the top 4:

    ./0194/salami_chords.txt:137.482630385  A, solo, | A:maj | x35, (guitar)
    ./0251/salami_chords.txt:155.892607709  fadeout, | Bb:min7 | x50, (saxophone), organ)
    ./0194/salami_chords.txt:260.263741496  A, outro, | A:maj | x60, (guitar)
    ./0194/salami_chords.txt:423.614693877  A, outro, | A:maj | x71, (voice)

    It is pretty clear that the repetition of these phrases has a huge effect on the outro section distribution. In a single song, "Willie And The Hand Jive" (song 0194 as seen in the bash output) by Eric Clapton, the outro section is played 3 separate times, all of which have A Major as its repeated harmony. The first time, A Major is repeated 60 times while the guitar is vamping, the second time the vocals come in and A Major is repeated 71 times, and finally the guitar comes back in and A Major is repeated 23 times to end the song. That's a total of 154 repetitions of the A Major chord in a single song, over a single section type! Of course, then, the frequency of A Major occurring in the outro section over all songs will be skewed. This, however, doesn't explain why the chorus section has A Major as its most frequent chord.

  2. Another feature of the song set that could skew the data is that not all the songs are in the same key. If we really wanted to compare the distribution of chords across all songs, we should be comparing chords that represent the same thing with their key signature. For example, if A Major happens 154 times in a song, and that song is in the key of A Major, we should instead be counting that chord as a tonic chord (I), so that we can get an idea of the function of these chords in their respective songs. I decided to analyze the key signature duration of each song. The Billboard data set only annotates the key root of each song, specified simply by the tag tonic. The tonic, therefore, could be either minor or Major, assuming that there is a statistically insignificant amount of mode-based pop songs. Furthermore, the tonic can change in the middle of a song, so I would need to line up the duration within each song that a key tonic prevails. In order to gauge what sort of chords we should find (for example, I suspect we'll find a lot of tonic chords in all songs), I segmented each tonic change in every song. For each of these tonic sections, I accumulated the durations, in seconds, of the labelled key tonic over all songs. Here are the results:

    Key tonic Duration (seconds)
    D 21765.5
    E 17071
    C 16342
    A 15326.6
    G 15145
    F 11223.8
    Ab 8487.62
    B 8403.65
    Bb 7300.25
    Eb 6475.45
    Db 5152.44
    F# 2279.42
    C# 1955.05
    Gb 617.277
    Cb 29.1722

    This still doesn't explain why A Major would be at the top of the chorus's chord frequency chart, but perhaps a look into the key-relative chords in conjuction with these tonic durations will give us more insight into that.

Reducing the Data

Grouping section names

The song structure was labelled by expert musicians, however the labels themselves were not formalized before the transcriptions took place. Therefore, the format of these sections labels differ from transcriber to transcriber. It is necessary, then, to normalize and reduce this set of possible section names. The process for this was relatively straightforward. In general, I made mappings where they were obvious, and if there were doubts in certain section names, I weighed the statistical impact before removing or absorbing given section names.

As I mentioned before the Billboard Data Set provides both section names and section letters. The names provide the common nomenclature for popular music song structure: names such as "verse" and "chorus", while the letters provide information such as section similarity, and can be labelled irrespective of common nomenclature. For example, an "A" section could represent an "intro" section or a "verse" section (or any other section that starts the song, even a "chorus" for example). The lettered sections especially allow for the identification of section repetitions; The ' symbol following a lettered section means that the current section repeats the previous occurrence of the aforementioned lettered section, but also includes some minor changes. This occurs often when a verse is repeated, for instance, and one or more chords are changed. That verse would represent the same section even though there is slightly different harmonic material. Because of this lettered labelling, it would be frivolous to incorporate repetition information in the named section labels as well. This leads me to the first example of how section names can be reduced.

The section key change only happens in song 0071, as does verse one, verse two, verse three, verse four, and verse five. This was discovered using the command

grep -r -i --exclude=chords.json --exclude=metadata.json $section_name ./

over the whole data set of labelled songs. I decided that this would be a good way to discover how many times a section occurs over the entire data set, for every contentious section label. grep is a perfect tool for this, because the number of lines on which a section label occurs in every song directly relates to the number of times that section occurs. I iterated through every possible song section, and created a yml file to map the section names to their corresponding group. I also listed the specific set of songs in which a misnamed section could be found. Many times there were labels which were only used 2-5 times in the entire data set. I also categorized those sections which would most likely not be of use for harmonic analysis and assigned them to the section group named X. Here are the comments which describe the sections that were absorbed or eliminated:

#"main theme" occurs only in two songs, in 0056, and 0251
#"(secondary) theme" occurs only in one song: 0056
#"secondary theme" only occurs in  0251
#"interlude" same as instrumental / instrumental break?  Probably should keep this one
#these next 6 ("verse [one-five]", "key change") all occur in a single song: 0071
#"key change" is irrelevant because we have "tonic" changes labelled already
#"vocal" should not be a section name
#intro-a and intro-b only occur in: 0085, 0803
#"fadein" only occurs once, in 0086
#"fade in" only occurs in 0218 and 0620
#"spoken" occurs only in 0590, and just hangs on 1 chord
#"spoken verse" only occurs once, and it has harmony: in 0484
#"pre-intro" occurs in: (0555, 0572, 0590, 0600, 0605), but for the most part has no harmony- exclude it
#"intro a" occurs only in 0599 and 0629
#"intro b" occurs only intro 0629
#"ending" occurs only in 0687, has an "N" label and a "&pause" label
#"chorus a" occurs only in 0864
#"chorus b" occurs only in 0864

You can download the mappings file here.

The result of these mappings is a much more concise and consolidated set of section names. I believe these represent the possibilities in song sections for popular music quite well.

[intro, verse, interlude, chorus, outro, solo, fadeout, instrumental, pre-chorus, bridge, coda, pre-verse, transition, X]

Relating the Chord Labels

What would be most useful in terms of the chord harmonies is to utilize the key tonic labels and to relate the chords to the key tonic in which that chord is found. That way, we could compare chords across songs with equivalence. Usin -the tonic is given for every part of every song -- tonics can change within a song -- the mode, however, is omitted, which means that songs will contain both major and minor chord distributions. This is not ideal, since a minor mode's tonic occurs a minor third below its relative major mode's tonic. Thus, the two are not harmonically equivalent. We'll revisit this later |TODO| Roman numerals chord_qualities = ['maj', 'minmaj7', 'min', 'dim', 'hdim', '5', '7', '9', '11', 'aug', 'sus'] quality_shorthand = ['', 'mM7', '', 'o', 'x', 'p', 'd', 'd', 'd', 'a', 's'] I reduced my data set to blah: aug, dim, maj, min... dominant... blah blah Given the distribution of key-relative chords, we can now revisit the unexpected results of the chorus section's chord distribution: the main keys are D and E, but our chorus section's most frequent chord is A. Looking at the roman numeral distribution, we see that A is the 5th of the most common key (D), and the 4th of the 2nd most common (E), and the tonic of the 4th most common key (A). This could explain why we see that chord so often. |TODO| run percentage for the chord distribution graphs (roman), then run percentages for the ttonic duration graphs, and then multiply the percentages together (this isn't a perfect representation because the keys could change in between sections) |TODO| compute the percentage of songs that stay in the same key for the whole song.

Converting the Durational Unit from Seconds to Beats

Computing Bars Per Phrase