Youn Hee Pernling Frödin

May 15, 2018

5 min read

The Mood of Music

This was my final project from my Data Science Immersive program experience at General Assembly.

In which of the top 10 music markets should a record label release a new artist? Based on the mood of the music and what music that is popular in the markets/countries.

Data
The initial data was two data sets that came from Kaggle. Spotify world wide daily song ranking and song lyrics from Lyrics.com. I then got more features from via Spotify’s API and also more lyrics by scraping Genius and also adding lyrics manually.

On average I had lyrics for more than 68 percent of the top songs from the 8 biggest music markets.

Limitations
-
Only look at data from 2017
- Assumption: all songs on the top lists are considered to be popular songs.
- Only look at the 10 biggest music markets. Japan and South Korea was excluded due to characters in the lyrics that was not roman. Which is problematic when working with Natural Language Process (NLP). So in the end it was the top 8 music markets.

Target/y-variable
Happy, sad, neutral came from the column Valence. Originally Valence came from the Spotify API. The variable went from 0 to 1. It described the musical positiveness of the track. Tracks with high valence sound were more positive (e.g. happy, cheerful, euphoric), while tracks with low valence sound were more negative (e.g. sad, depressed, angry).

Correlation
It was only the variable Energy that showed some kind of correlation with our target. Which is not so good.

Some tools and methods used
- Sentiment Analysis in TextBlob (NLP)
- Scikit learn built in working with text functions: CountVectorizer and TfidfVectorizer (NLP)
- Linear Regerssion
- Lasso Regressor
- Random Forest Regressor
- Ada Boosting Regressor
- Gradient Boosting Regressor
- GridSearchCV (tuning hyperparameters)
- RandomizedSearchCV (tuning hyperparameters)
- Standard Scaler (Feature scaling) (normalization)

Models and other things

I used sentiment analysis from TextBlob for all models and in all models accept for the MVP I also used Count Vectorizer and Tfidf-Vectorizer to get specific words (stop words in 4 different language were used). I did scale befor running the models, accept for the Random Forest which do not need scaling. I started with Grid Search CV but due to the time it takes to run I gradually changed to Randomized Search CV for tuning my hyper parameters. I used different numbers of features, both looking at the most common and the most important. For the last models I also used dummies for the different markets and average number of streams and average top list position.

Comparing the Best Models

To compare the models in the best way I ran all the models 3 times with different random state (to make sure that the first time I ran the models it was not just good or bad luck in the split). I then took the average of the 3 scores to get the green dot in the plot, and the standard deviation of the 3 stores to get the red line.

The Best Model
The best model was the Random Forest where Count Vectorizer and all features were used. It ended up scoring 22.93%. Giving us a good indication to that the data that I use were not that good to predict the mood of a song.

Most Important Features

As we could see in the heat map over correlation Energy was the most important feature. The other features was not as important. Above you can see the top 10 most important feature.

In which markets were the model strong?

The best model (the Random forest) did preform differently for the different markets. It did a little bit better in Canada and a lot worst in Italy. In the future it would be interesting to study the differences more to try to find out why is did differ a lot in performance.

The markets seem to be very different

Above you can see the most popular song for the US in 2017 (based on the top list from Spotify). The song is Goosebumps by Travis Scott. As you can see the song was on the top list for US and Canada during the entire year. But it seems that it was not at all on the top list for France and the Netherlands. However this is not 100 % correct. The plot show the top list position per every seventh day. The song was actually one or two time on the top list in France and the Netherlands. However the plot stil give us a good visual on how different in popularity songs can be in the different markets.

Take aways
It turned out that predicting the mood of a song is hard.

I learnt a lot about music popularity in the largest music markets and I want to focus more on the difference between the markets and how you best can make good use of that knowledge as a company connected to the music industry.

HERE you can read more about the Capstone project