Lately I’ve been thinking a lot about changes in pop music over time. In particular, I wanted to get a feel for whether or not pop music in the United States is getting less intelligent (a common criticism from older folks). To prove or deny this claim, I defined “less intelligent” as having vocabulary that is less eloquent and less varied. There’s a lot to be learned from pop’s vocabulary, since it reflects the culture of a large group of Americans. Calling pop music unintelligent is a stone’s throw from calling the American population unintelligent.
After rummaging through the web, I found this Huffington Post article, which displays common words in pop lyrics over time. I think the data says a lot about how American rhetoric has changed over time. In particular, pop lyrics have seen a rise in words relating to sex, violence, and drugs. I wanted to contribute to the conversation with a more general study focused on lyrical intelligence, rather than rhetoric. I found this article by William Briggs, which boldly states that music has gotten much stupider. He looks at the ratio of unique words to total words in pop music and uses that as evidence of pop’s growing stupidity.
Eager to procrastinate my undergraduate thesis work, check Briggs’s results, and write some Python, I set out to perform a similar study using Top 40 hits since 1950. Briggs doesn’t say explicitly where his data comes from, but it appears to be roughly the same sample. Top40 Charts lists all of the top 40 hits from 1950 on, and each year lists all of the top 40 in a simple table. I could easily grab all of them with a quick loop and some help from Beautiful Soup. From there, I threw the artists and titles into MongoDB with PyMongo as my driver.
So I had roughly all of the top 40 hits, with the exception of some noise, since a handful of the entries I scraped had typos or other fuzz. Now I had to get the lyrics. This one was tougher: I couldn’t find an accessible and free API for song lyrics, and I wasn’t looking forward to building a smart scraper that could traverse Google searches. It turns out lyrics.wikia.com has a lot of lyrics, and the page layout is consistent across lots of songs. The only issue here was that the URLs for each song were very specific, and any small discrepancy would lead to a 404. For eras like the 1950’s, when big band standards were covered repeatedly by artists that could be named “So-and-So”, “So-and-So & The Orchestra”, “So-and-So And His Orchestra”, etc, this made scraping difficult. I helped the process along by curating titles and artists with different regular expressions like removing featured names and removing phrases in parenthesis. In the end, this is the breakdown I got for successfully scraped lyrics. I figured it would be enough for a reliable analysis.
From here, it wasn’t so hard to sanitize the lyrics and compare across years. First, I explored the number of unique words per song, the total words per song, and the ratio between the two. My results confirmed what Briggs had described: unique words and total words have risen over time, but the ratio has gone down significantly. This would suggest that pop lyrics have in fact gotten less eloquent.
The rise in total lyrics can probably be attributed to a shift in genre: big band music and disco likely have less words per song than rap or rock and roll. And, earlier big band music might not have had a lyrical chorus, which would significantly reduce the number of repeated words. Interestingly, total and unique words experience a peak in 2003, which could be due to the large number of rap and R&B hits at the time. The year’s top 40 are smattered with 50 Cent, Eminem, Jay Z, and others. The dip since then could be a product of the popularity of electronic music and dance hits.
I wanted to dig a little deeper, so I explored the average word length in lyrics over time. The average word length was extremely close to four characters every year. I also found the average number of words of various lengths over time. Likewise, the ratio of four, five, six, seven, and eight character words has stayed pretty consistent over time. So, while there might be a lower ratio of unique to total words in today’s music, the words that were being used way back when were not necessarily “more intelligent” words. That being said, word length alone doesn’t say a whole lot about the quality of those words.
As a sanity check, I finished by making sure that the data was relatively consistent. It would be a shame if my results were being thrown off by a few outliers with tons of large words or tons of unique words. I examined this possibility by calculating the coefficient of variation for word length per song and unique words per song. As a rule of thumb, when the coefficient of variation is less than 1, the data is considered to be fairly consistent. You can read more about determining CV from this StackExchange post, which leads to other helpful sources. It turns out that the lyrics are pretty consistent within each year, so outliers are not skewing results.
A more comprehensive analysis of pop eloquence would require a better understanding of the meaning of the lyrics, but this 10,000 foot view suggests that pop lyrics have gotten less creative over time. Lyrics certainly seem to be repeating themselves more often. The general rise in both number of words and unique words per song can possibly be attributed to more lyrically dense music like rap or indie rock, as compared to big band music from the 1950’s and disco from the 1970’s. Since the average word length has hardly changed over time, I think it’s difficult to definitively call today’s music “stupider”, ask Briggs suggests. A safer and more substantial claim would be that music has gotten wordier and takes advantage of repeated choruses more frequently.
All of the code for this experiment is hosted on Github. The database is even dumped there if you want access to the lyrics.
I’ve left my comment on HN