Spotify & Billboard Top 100: Song Popularity

Sam DeMarinis and Nick Radwin

Link to our webpage: https://samdemarinis.github.io/

Project Goals

The goal of this project is to understand what features of a song are strongly associated with high popularity in order to build an accurate model that predicts a given song's popularity. We will be incorporating datasets found online that have utilized Spotify's API as well as creating our own to see how a prediction for a song's popularity may change depending on the dataset used to build the model. Lastly, we will be utilizing Billboard Top 100 data before and after the COVID-19 pandemic to investigate how the importance of certain features of a song in predicting popularity may have changed from before/after the pandemic.

Key Questions to Answer

  • Where did the dataset we are utilizing come from? How was it obtained and how are the variables defined?

  • How can we build an accurate model to predict a song's 'popularity' based off features such as 'name', 'album', 'artist', 'release_date', 'length', 'explicit', 'acousticness', 'danceability', 'energy', 'instrumentalness', 'liveness', 'loudness', 'speechiness', 'tempo', 'time_signature', 'key', 'mode', and 'valence'?

  • What features provided by Spotify's API are the most important in predicting song popularity?

  • Can we build a predictive model for popularity on Spotify data that we pulled ourselves? If so, how does this model compare to the one that we built on the pre-packaged dataset "Spotify Tracks Dataset"?

  • Is there a relationship between song popularity on Spotify and song popularity on the Billboard Top 100? Are we able to predict how many weeks a song will appear on the Billboard Top 100 Chart in a year?

  • How has the importance of the above features changed in predicting a song's popularity pre- and 'post-' COVID-19?

  • We hypothesize that people are listening to sadder, less positive music after the COVID-19 pandemic. Is this hypothesis supported?

Spotify Tracks Dataset

The first dataset that we are working with is the Spotify Tracks Dataset available on Kaggle (https://www.kaggle.com/datasets/maharshipandya/-spotify-tracks-dataset). This dataset was constructed by the Kaggle user @MAHARSHIPANDYA in 2022 using Spotify's API. It contains almost 120,000 rows each corresponding to a song on Spotify. While the method for selecting the songs is not clearly stated, it can be assumed that it was a randomized process due to the large variety of songs. The columns of the dataset describe features of the songs, ranging from basic identifiers such as track_id and and album_name to more musical qualities such as energy and danceability. The full breakdown of what each variable entails can be found under the Column Description of the Spotify Tracks Dataset, linked above.

Methods for Quantifiying Subjective Variables

The variable that we are initially interested in predicting is popularity. But how does one measure things like popularity, energy, or danceability? Spotify Web API describes how these seemingly subjective measures were calculated. Below is a list of the most subjective variables in the dataset and how they were calculated.

popularity: "The value will be between 0 and 100, with 100 being the most popular. The popularity of a track is a value between 0 and 100, with 100 being the most popular. The popularity is calculated by algorithm and is based, in the most part, on the total number of plays the track has had and how recent those plays are."

danceability: "Danceability describes how suitable a track is for dancing based on a combination of musical elements including tempo, rhythm stability, beat strength, and overall regularity. A value of 0.0 is least danceable and 1.0 is most danceable."

energy: "Energy is a measure from 0.0 to 1.0 and represents a perceptual measure of intensity and activity. Typically, energetic tracks feel fast, loud, and noisy. For example, death metal has high energy, while a Bach prelude scores low on the scale."

speechiness: "Speechiness detects the presence of spoken words in a track. The more exclusively speech-like the recording (e.g. talk show, audio book, poetry), the closer to 1.0 the attribute value. Values above 0.66 describe tracks that are probably made entirely of spoken words. Values between 0.33 and 0.66 describe tracks that may contain both music and speech, either in sections or layered, including such cases as rap music. Values below 0.33 most likely represent music and other non-speech-like tracks."

liveness: "Detects the presence of an audience in the recording. Higher liveness values represent an increased probability that the track was performed live. A value above 0.8 provides strong likelihood that the track is live."

valence: "A measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track. Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry)."

mode: "Mode indicates the modality (major or minor) of a track, the type of scale from which its melodic content is derived. Major is represented by 1 and minor is 0."

Now that we understand how these variables were quantified, we can properly assess their level of subjectivity. Popularity, the label we will be initially predicting, is based on an algorithm involving a track's plays. This variable is not very subjective. However, variables like danceability and valence still contain some subjectivity. Knowing this and what it is we want to predict, we can begin extracting, transforming, and loading this dataset.

Extracting, Transforming, and Loading: Spotify Tracks Dataset

First, we must mount our Google Drive and change the directory to our Drive.

In [ ]:
from google.colab import drive
import pandas as pd
import pylab as pl

drive.mount('/content/drive')
%cd /content/drive/My Drive
Mounted at /content/drive
/content/drive/My Drive

Reading in the Spotify Tracks Datatset csv file and making a copy:

In [ ]:
spotify_df = pd.read_csv("dataset.csv")
spotify_df_copy = spotify_df.copy()
spotify_df_copy.head()
Out[ ]:
Unnamed: 0 track_id artists album_name track_name popularity duration_ms explicit danceability energy ... loudness mode speechiness acousticness instrumentalness liveness valence tempo time_signature track_genre
0 0 5SuOikwiRyPMVoIQDJUgSV Gen Hoshino Comedy Comedy 73 230666 False 0.676 0.4610 ... -6.746 0 0.1430 0.0322 0.000001 0.3580 0.715 87.917 4 acoustic
1 1 4qPNDBW1i3p13qLCt0Ki3A Ben Woodward Ghost (Acoustic) Ghost - Acoustic 55 149610 False 0.420 0.1660 ... -17.235 1 0.0763 0.9240 0.000006 0.1010 0.267 77.489 4 acoustic
2 2 1iJBSr7s7jYXzM8EGcbK5b Ingrid Michaelson;ZAYN To Begin Again To Begin Again 57 210826 False 0.438 0.3590 ... -9.734 1 0.0557 0.2100 0.000000 0.1170 0.120 76.332 4 acoustic
3 3 6lfxq3CG4xtTiEg7opyCyx Kina Grannis Crazy Rich Asians (Original Motion Picture Sou... Can't Help Falling In Love 71 201933 False 0.266 0.0596 ... -18.515 1 0.0363 0.9050 0.000071 0.1320 0.143 181.740 3 acoustic
4 4 5vjLSffimiIP26QG5WcN2K Chord Overstreet Hold On Hold On 82 198853 False 0.618 0.4430 ... -9.681 1 0.0526 0.4690 0.000000 0.0829 0.167 119.949 4 acoustic

5 rows × 21 columns

The dataset already seems to be pretty tidy, but there is some more tidying to be done. There is an extra column with no purpose, so we can go ahead and remove it:

In [ ]:
spotify_df_copy.drop(columns=["Unnamed: 0"], inplace=True)

We should always check the dtypes for each of our variables. Pandas may misinterpret certain variables as categorical instead of quanitative, or vice versa. Luckily, we can see that Pandas correctly interpreted the dtype for each variable in our DataFrame:

In [ ]:
spotify_df_copy.dtypes
Out[ ]:
track_id             object
artists              object
album_name           object
track_name           object
popularity            int64
duration_ms           int64
explicit               bool
danceability        float64
energy              float64
key                   int64
loudness            float64
mode                  int64
speechiness         float64
acousticness        float64
instrumentalness    float64
liveness            float64
valence             float64
tempo               float64
time_signature        int64
track_genre          object
dtype: object

We should also check for duplicate and NaN values in our DataFrame:

In [ ]:
spotify_df_copy.duplicated().sum()
Out[ ]:
450
In [ ]:
spotify_df_copy.isnull().sum().sum()
Out[ ]:
3

We can now drop the duplicate rows and rows containing NaNs:

In [ ]:
spotify_df_copy.drop_duplicates(inplace=True)
spotify_df_copy.dropna(inplace=True)
spotify_df_copy.head()
Out[ ]:
track_id artists album_name track_name popularity duration_ms explicit danceability energy key loudness mode speechiness acousticness instrumentalness liveness valence tempo time_signature track_genre
0 5SuOikwiRyPMVoIQDJUgSV Gen Hoshino Comedy Comedy 73 230666 False 0.676 0.4610 1 -6.746 0 0.1430 0.0322 0.000001 0.3580 0.715 87.917 4 acoustic
1 4qPNDBW1i3p13qLCt0Ki3A Ben Woodward Ghost (Acoustic) Ghost - Acoustic 55 149610 False 0.420 0.1660 1 -17.235 1 0.0763 0.9240 0.000006 0.1010 0.267 77.489 4 acoustic
2 1iJBSr7s7jYXzM8EGcbK5b Ingrid Michaelson;ZAYN To Begin Again To Begin Again 57 210826 False 0.438 0.3590 0 -9.734 1 0.0557 0.2100 0.000000 0.1170 0.120 76.332 4 acoustic
3 6lfxq3CG4xtTiEg7opyCyx Kina Grannis Crazy Rich Asians (Original Motion Picture Sou... Can't Help Falling In Love 71 201933 False 0.266 0.0596 0 -18.515 1 0.0363 0.9050 0.000071 0.1320 0.143 181.740 3 acoustic
4 5vjLSffimiIP26QG5WcN2K Chord Overstreet Hold On Hold On 82 198853 False 0.618 0.4430 2 -9.681 1 0.0526 0.4690 0.000000 0.0829 0.167 119.949 4 acoustic

Our DataFrame is now in great shape for analysis. An important note to make is that, while we requested for Pandas to drop duplicate rows (or tracks) in our DataFrame, there are still multiple rows that are the same song. This is due to the fact that some songs are listed as being in more than one genre, and for each genre they fall into they are entered another time into the dataset. Pandas does not interpret these rows as being duplicates because they differ on the track_genre variable. To see this, let's look up all rows that have a specific track_id:

In [ ]:
spotify_df_copy[(spotify_df_copy['track_id'] == "2QjOHCTQ1Jl3zawyYOpxh6")]
Out[ ]:
track_id artists album_name track_name popularity duration_ms explicit danceability energy key loudness mode speechiness acousticness instrumentalness liveness valence tempo time_signature track_genre
2003 2QjOHCTQ1Jl3zawyYOpxh6 The Neighbourhood I Love You. Sweater Weather 93 240400 False 0.612 0.807 10 -2.81 1 0.0336 0.0495 0.0177 0.101 0.398 124.053 4 alt-rock
3003 2QjOHCTQ1Jl3zawyYOpxh6 The Neighbourhood I Love You. Sweater Weather 93 240400 False 0.612 0.807 10 -2.81 1 0.0336 0.0495 0.0177 0.101 0.398 124.053 4 alternative
81853 2QjOHCTQ1Jl3zawyYOpxh6 The Neighbourhood I Love You. Sweater Weather 93 240400 False 0.612 0.807 10 -2.81 1 0.0336 0.0495 0.0177 0.101 0.398 124.053 4 pop
91002 2QjOHCTQ1Jl3zawyYOpxh6 The Neighbourhood I Love You. Sweater Weather 93 240400 False 0.612 0.807 10 -2.81 1 0.0336 0.0495 0.0177 0.101 0.398 124.053 4 rock

Let's investigate this table before we start constructing a model. Since we are interested in analyzing and eventually creating a model to predict the popularity of a song, let's look up songs with a popularity rating greater than 90:

In [ ]:
# let's look at songs with high popularity:
spotify_df_copy["popularity"].unique()
popular_songs = spotify_df_copy[(spotify_df_copy["popularity"]  >= 90)]["track_id"].unique()
popular_songs
Out[ ]:
array(['2QjOHCTQ1Jl3zawyYOpxh6', '3JvKfv6T31zO0ini8iNItO',
       '5IgjP7X4th6nMNDh4akUHb', '3nqQXoyQOWXiESFLlDF1hG',
       '4uUG5RXrOk84mYEfFvj3cK', '0mBP9X2gPCuapvpZ7TGDk3',
       '0hquQWY3xvYqN4qtiquniF', '4C6Uex2ILwJi9sZXRdmqXp',
       '4zN21mbAuaD0WqtmaTZZeP', '1xzi1Jcr7mEi9K2RfzLOqS',
       '1Fid2jjqsHViMX6xNH70hE', '5XeFesFbtLpXzIVDNQP22n',
       '3F5CgOj3wFlRv51JsHbxhe', '38T0tPVZHcPZyhtOcCP7pF',
       '2tTmW7RDtMQtBk7m2rYeSw', '5ww2BF9slyYgNOk37BlC4u',
       '6Sq7ltF9Qa7SNFBsV5Cogx', '1IHWl5LamUGEuP4ozKQSXZ',
       '3k3NWokhRRkEPhCzPmV8TW', '5Eax0qFko2dh7Rl2lYs3bx',
       '2rurDawMfoKP4uHyb2kJBt', '6Xom58OOXk2SoU711L2IXO',
       '31i56LZnwE6uSu3exoHjtB', '79HZAZNnOE97rb2hnI0XQr',
       '41oY4WCTj5kccfesTVFnvN', '6i1g5ZRmJZAkDwBaUZ3f2i',
       '4tYFy8ALRjIZvnvSLw5lxN', '1797zYiX4cKosMH836X9Gt',
       '4h9wh7iOZ0GGn8QVp4RAOB', '75FEaRjZTKLhTrFGsfMUXR',
       '0VjIjW4GlUZAMYd2vXMi3b', '7MXVkk9YMctZqd1Srtv4MB',
       '58HvfVOeJY7lUuCqF0m3ly', '2eAvDnpXP5W0cVtiI0PUxV',
       '0T5iIrXA4p5GsubkhuBIKV', '4LRPiXqCikLlN15c3yImP7',
       '0WtM2NBVQNNJLh6scP13H8', '1cKHdTo9u0ZymJdPGSh6nq',
       '6xGruZOHLs39ZbVccQTuPZ', '4Dvkj6JhhA12EX05fT7y2e',
       '7dSZ6zGTQx66c2GF91xCrb', '1ga4PztXOIw1yBbdUt2X8v',
       '7DbdUf8aHSYoliSjO6LZv6'], dtype=object)

Let's say that we are interested in looking at the energy and danceability of these popular songs, and if there is a common trend for this variable:

In [ ]:
# let's look at the energy of these songs:
energy_list = []
dance_list = []
for i in popular_songs:
  x = spotify_df_copy[(spotify_df_copy['track_id'] == i)]["energy"].iloc[0]
  y = spotify_df_copy[(spotify_df_copy['track_id'] == i)]["danceability"].iloc[0]
  energy_list.append(x)
  dance_list.append(y)

data = {'track_id':popular_songs,
        'Energy':energy_list,
        'Danceability':dance_list
        }
        
df_pop_ED = pd.DataFrame(data, index=range(0,43))
df_pop_ED.head()
Out[ ]:
track_id Energy Danceability
0 2QjOHCTQ1Jl3zawyYOpxh6 0.807 0.612
1 3JvKfv6T31zO0ini8iNItO 0.537 0.445
2 5IgjP7X4th6nMNDh4akUHb 0.690 0.733
3 3nqQXoyQOWXiESFLlDF1hG 0.472 0.714
4 4uUG5RXrOk84mYEfFvj3cK 0.965 0.561

We can plot these energy and danceability ratings:

In [ ]:
import seaborn as sns
import matplotlib.pyplot as plt

fig, ax = plt.subplots(figsize=(8, 8))
 
sns.histplot(data=df_pop_ED, x='Energy', color='green', bins=20, stat='frequency', kde=True).set(title='Energy Rating of Most Popular Songs')
plt.show()
In [ ]:
fig, ax = plt.subplots(figsize=(8, 8))
 
sns.histplot(data=df_pop_ED, x='Danceability', color='green', bins=15, stat='frequency', kde=True).set(title='Danceability Rating of Most Popular Songs')
plt.show()

From the histograms, we can see that the most common energy rating among the most popular songs is around 0.7 out of 1, and the most common dancability is around 0.65 out of 1.

Now, let's take a look at the most common genres of the most popular songs:

In [ ]:
popular_songs_df = spotify_df_copy[(spotify_df_copy["popularity"]  >= 90)]

fig, ax = plt.subplots(figsize=(12, 6))
plot2 = sns.countplot(data=popular_songs_df, x='track_genre', order = popular_songs_df['track_genre'].value_counts().index, ax=ax).set(title='Genre of Most Popular Songs', xlabel='Genre', ylabel='Frequency')
ax.set_xticklabels(ax.get_xticklabels(), rotation=45, horizontalalignment='right')
plt.show()

We can now see the distribution of genres among the most popular songs. Unsurprisingly, the most common genre of these songs is pop.

From our exploratory data analysis, we can see that popular songs are generally pop songs that have an energy rating of about 0.7 a dancability rating of about 0.65. However, this isn't very helpful if we are producing an R&B song, for example. So, let's build a predictive model that allows us to generalize across genre to predict the popularity rating of a song just using its features!

Creating a Predictive Model for Song Popularity

Now that we have looked at some of the summary statistics for this dataset, let's begin constructing a predictive model. We will use the machine learning algorithm of k-nearest neighbors regression and determine which value of k results in the lowest mean absolute error.

Hyperparameter Tuning

Let's first determine the best k value (number of neighbors) for our model:

In [ ]:
from sklearn.feature_extraction import DictVectorizer
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsRegressor
from sklearn.pipeline import Pipeline
import numpy as np
from sklearn.model_selection import cross_val_score

vec = DictVectorizer(sparse=False)
scaler = StandardScaler()

features = ['acousticness', 'danceability',
            'energy', 'instrumentalness', 'liveness', 'loudness',
            'speechiness', 'tempo', 'time_signature', 'key', 'mode', 'valence']

small_df = spotify_df_copy.head(100)
X_dict = small_df[features].to_dict(orient="records")
y = small_df["popularity"]

# calculates estimate of test error based on 10-fold cross validation
def get_cv_error(k):
    model = KNeighborsRegressor(n_neighbors=k, n_jobs=-1)
    pipeline = Pipeline([("vectorizer", vec), ("scaler", scaler), ("fit", model)])
    mse = np.mean(-cross_val_score(
        pipeline, X_dict, y, 
        cv=10, scoring="neg_mean_squared_error"
    ))
    return mse
    
ks = pd.Series(range(1, 51))
ks.index = range(1, 51)
test_errs = ks.apply(get_cv_error)

test_errs.plot.line(figsize=(6,6), color='green')
plt.title("Validation MAE as a Function of K", size=30)
plt.xlabel("k", size=25)
plt.ylabel("Validation MAE", size=25)
Out[ ]:
Text(0, 0.5, 'Validation MAE')
In [ ]:
lst = []
for i,t in enumerate(test_errs):
  lst.append(get_cv_error(i+1))

print(np.array(lst).argmin())
5

According to the minimum validation Mean Average Error (MAE), the best value of k for our predictive model is 5. Now we can build our predictive model!

5-Nearest Neighbor Predictive Model:

In [ ]:
# define the training data
X_train = spotify_df_copy[[
            "danceability",
            "energy",
            "loudness", 
            "speechiness",
            "instrumentalness", 
            "time_signature"
            ]]

y_train = spotify_df_copy["popularity"]

# standardize the data
scaler = StandardScaler()
scaler.fit(X_train)
X_train_sc = scaler.transform(X_train)

# fit the 9-nearest neighbors model
model = KNeighborsRegressor(n_neighbors=5)
model.fit(X_train_sc, y_train)

# define the test data (Scikit-Learn expects a matrix)
x_new = pd.DataFrame()

x_new["danceability"] = [0.420]
x_new["energy"] = [0.166]
x_new["loudness"] = [-17.235]
x_new["speechiness"] = [0.0763]
x_new["instrumentalness"] = [0.000006]
x_new["time_signature"] = [4]

x_new_sc = scaler.transform(x_new)

# use the model to predict on the test data
model.predict(x_new_sc)
Out[ ]:
array([42.])

A song with these specific feature values has a true popularity of 55/100. However, our model predicts that this song would have a popularity of about 42/100. Let's take a look at how much error there is in our model on average using cross-validation:

Cross-Validation Error Report

In [ ]:
from sklearn.feature_extraction import DictVectorizer
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsRegressor
from sklearn.pipeline import Pipeline
from sklearn.model_selection import cross_val_score
import numpy as np

# get the features (in dict format) and the labels
# (do not split into training and validation sets)
features = [
            "danceability",
            "energy",
            "loudness", 
            "speechiness",
            "instrumentalness", 
            "time_signature"
            ]
X_dict = spotify_df_copy[features].to_dict(orient="records")
y = spotify_df_copy["popularity"]

# specify the pipeline
vec = DictVectorizer(sparse=False)
scaler = StandardScaler()
model = KNeighborsRegressor(n_neighbors=5)
pipeline = Pipeline([("vectorizer", vec), ("scaler", scaler), ("fit", model)])

scores = cross_val_score(pipeline, X_dict, y, 
                         cv=5, scoring="neg_mean_squared_error")
np.sqrt(np.mean(-scores))
Out[ ]:
22.132933743391032

The average error of our model is about 22 popularity points out of 100. After fine-tuning with an ablation study looking at different combinations of features to predict popularity, we concluded that the average error doesn't get much better than 22 points. So a song's features are relatively good at predicting popularity, but it is important to keep external factors in mind such as the historical context around a song's release that also contribute to a particular song's popularity.

Creating Our Own Dataset Using Spotipy

We have just created a predictive model for song popularity based on the Spotify Tracks Dataset. Now, we are interested in constructing our own dataset to see how varying sources of data describing the same units of observation and features can result in more/less accurate predictive models. In order to build our own dataset of Spotify data, we must use Spotify's API.

A popular Python library for the Spotify Web API is called Spotipy. We will be utilizing this library because it is easy to unerstand and allows us to demonstrate commands in this notebook.

In [ ]:
# installing spotipy
!pip install urllib3
!pip install requests
!pip install spotipy
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Requirement already satisfied: urllib3 in /usr/local/lib/python3.8/dist-packages (1.24.3)
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Requirement already satisfied: requests in /usr/local/lib/python3.8/dist-packages (2.23.0)
Requirement already satisfied: chardet<4,>=3.0.2 in /usr/local/lib/python3.8/dist-packages (from requests) (3.0.4)
Requirement already satisfied: urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 in /usr/local/lib/python3.8/dist-packages (from requests) (1.24.3)
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.8/dist-packages (from requests) (2022.9.24)
Requirement already satisfied: idna<3,>=2.5 in /usr/local/lib/python3.8/dist-packages (from requests) (2.10)
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting spotipy
  Downloading spotipy-2.22.0-py3-none-any.whl (28 kB)
Collecting requests>=2.25.0
  Downloading requests-2.28.1-py3-none-any.whl (62 kB)
     |████████████████████████████████| 62 kB 1.2 MB/s 
Requirement already satisfied: six>=1.15.0 in /usr/local/lib/python3.8/dist-packages (from spotipy) (1.15.0)
Collecting urllib3>=1.26.0
  Downloading urllib3-1.26.13-py2.py3-none-any.whl (140 kB)
     |████████████████████████████████| 140 kB 29.9 MB/s 
Collecting redis>=3.5.3
  Downloading redis-4.4.0-py3-none-any.whl (236 kB)
     |████████████████████████████████| 236 kB 58.8 MB/s 
Requirement already satisfied: async-timeout>=4.0.2 in /usr/local/lib/python3.8/dist-packages (from redis>=3.5.3->spotipy) (4.0.2)
Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.8/dist-packages (from requests>=2.25.0->spotipy) (2.10)
Requirement already satisfied: charset-normalizer<3,>=2 in /usr/local/lib/python3.8/dist-packages (from requests>=2.25.0->spotipy) (2.1.1)
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.8/dist-packages (from requests>=2.25.0->spotipy) (2022.9.24)
Installing collected packages: urllib3, requests, redis, spotipy
  Attempting uninstall: urllib3
    Found existing installation: urllib3 1.24.3
    Uninstalling urllib3-1.24.3:
      Successfully uninstalled urllib3-1.24.3
  Attempting uninstall: requests
    Found existing installation: requests 2.23.0
    Uninstalling requests-2.23.0:
      Successfully uninstalled requests-2.23.0
Successfully installed redis-4.4.0 requests-2.28.1 spotipy-2.22.0 urllib3-1.26.13
In [ ]:
from fastai.collab import *
from fastai.tabular import *
import pandas as pd
import numpy as np
import spotipy
import spotipy.util as util
from spotipy.oauth2 import SpotifyClientCredentials
import spotipy.oauth2 as oauth2
from spotipy.oauth2 import SpotifyOAuth

In order to make calls to Spotify's API using Spotipy, we must register a Spotify account as a Developer. We will then create an Application, obtain our Client ID and Secret, and enter in a specific Redirect URI.

In [ ]:
# Sam's Spotify credentials

CLIENT_ID = "92f8907c269f4703917254ec2454f4c7"
CLIENT_SECRET = "fea4962791dd466d840feba0e11a11df"
username = "sam7303"
market = ['US']
redirect_uri = 'http://127.0.0.1:9090'
scope="user-library-read"

Now that we have our credentials stored into variables, we can create a Spotipy object that authorizes us as a valid Developer:

In [ ]:
# creating spotipy object

sp = spotipy.Spotify(auth_manager=SpotifyOAuth(client_id=CLIENT_ID,
                                               client_secret=CLIENT_SECRET,
                                               redirect_uri=redirect_uri,
                                               scope="user-library-read", open_browser=False))

We are interested in understanding what features may be indicative of a song's popularity at this point in time. Every few days, Spotify updates the playlist "Today's Top Hits" to contain the top 50 most played songs on Spotify. This is a good indication that a song is popular. We will be analyzing the features of the songs in the playlist over the course of a few weeks so that we can see the breakdown of the features.

We can obtain the playlist ID by looking at the end of its URL. Next, we can access each of the tracks in the playlist with the command below:

(Note: The code below will fetch the most recent songs in the "Today's Top Hits" playlist. The returned song data depends on when the code is run.)

In [ ]:
# "Today's Top Hits" playlist -- 50 of the most popular songs

PLAYLIST_ID = '37i9dQZF1DXcBWIGoYBM5M'
playlist_tracks = sp.user_playlist_tracks(CLIENT_ID, PLAYLIST_ID, fields='items,uri,name,id,total')

Now, we can find the song ID for each song in this playlist. This will allow us to then access the desired features for each song and create a dataset.

In [ ]:
songIDs = []

for i in range(len(playlist_tracks['items'])):
  songIDs.append(playlist_tracks['items'][i]['track']['id'])
  
print(songIDs)
len(songIDs)
['5Y35SjAfXjjG0sFQ3KOxmm', '0V3wPSX9ygBnCm8psDIegu', '3nqQXoyQOWXiESFLlDF1hG', '1xzi1Jcr7mEi9K2RfzLOqS', '2dHHgzDwk4BJdRwy9uXhTO', '0WtM2NBVQNNJLh6scP13H8', '0QHEIqNKsMoOY5urbzN48u', '4LRPiXqCikLlN15c3yImP7', '1IHWl5LamUGEuP4ozKQSXZ', '4uUG5RXrOk84mYEfFvj3cK', '1bDbXMyjaUIooNwFE9wn0N', '0O6u0VJ46W86TxN9wgyqDj', '73vIOb4Q7YN6HeJTbscRx5', '1RDvyOk4WtPCtoqciJwVn8', '4h9wh7iOZ0GGn8QVp4RAOB', '5jQI2r1RdgtuT8S3iG8zFC', '34ZAzO78a5DAVNrYIGWcPm', '5odlY52u43F5BjByhxg7wg', '5ww2BF9slyYgNOk37BlC4u', '1qEmFfgcLObUfQm0j1W2CK', '35ovElsgyAtQwYPYnZJECg', '5Z2MiIZ5I3jJvvmeWMLbOQ', '38T0tPVZHcPZyhtOcCP7pF', '5IgjP7X4th6nMNDh4akUHb', '2mnXxnrX5vCGolNkaFvVeM', '0HqZX76SFLDz2aW8aiqi7G', '3WMj8moIAXJhHsyLaqIIHI', '76OGwb5RA9h4FxQPT33ekc', '0hquQWY3xvYqN4qtiquniF', '78Sw5GDo6AlGwTwanjXbGh', '5CM4UuQ9Gnd6K2YyKGPMoK', '1PckUlxKqWQs3RlWXVBLw3', '6BePGk3eCan4FqaW2X8Qy3', '0T5iIrXA4p5GsubkhuBIKV', '26hOm7dTtBi0TdpDGl141t', '4FyesJzVpA39hbYvcseO2d', '4C6Uex2ILwJi9sZXRdmqXp', '0mBP9X2gPCuapvpZ7TGDk3', '4JBiO7wRnE6ueszEUpo347', '5HCyWlXZPP0y6Gqq8TgA20', '2TktkzfozZifbQhXjT6I33', '5ildQOEKmJuWGl2vRkFdYc', '59nOXPmaKlBfGMDeOVGrIK', '72yP0DUlWPyH8P7IoxskwN', '5hnGrTBaEsdukpDF6aZg8a', '1Ame8XTX6QHY0l0ahqUhgv', '2tTmW7RDtMQtBk7m2rYeSw', '5unjCay0kUjuej5ebn4nS4', '3LtpKP5abr2qqjunvjlX5i', '0ARKW62l9uWIDYMZTUmJHF']
Out[ ]:
50

We can utilize the track() function provided by Spotipy to access important information and features about each song. For example, the following code accesses a few of these features for a specific song:

In [ ]:
track_info = sp.track("0V3wPSX9ygBnCm8psDIegu")
track_features = sp.audio_features('0V3wPSX9ygBnCm8psDIegu')

# some important info
name = track_info['name']
album = track_info['album']['name']

# some important features
acousticness = track_features[0]['acousticness']
danceability = track_features[0]['danceability']

print(f'Song name: {name}')
print(f'Album: {album}')
print(f'Acousticness: {acousticness}')
print(f'Danceability: {danceability}')
Song name: Anti-Hero
Album: Midnights
Acousticness: 0.13
Danceability: 0.637

Now, we will follow these steps for each of the songs in the "Today's Top Hits" playlist. We will define a function that will give a list of track features given a song ID, and then call that function on each song ID in the songIDs list so that we can construct a DataFrame.

In [ ]:
def getTrackFeatures(songID):

  track_info = sp.track(songID)
  track_features = sp.audio_features(songID)

  name = track_info['name']
  album = track_info['album']['name']
  artist = track_info['album']['artists'][0]['name']
  release_date = track_info['album']['release_date']
  length = track_info['duration_ms']
  popularity = track_info['popularity']
  explicit = track_info['explicit']
  #genre = track_info['genre']
  # currently, spotify does not list the specific genre a track falls in. instead it offers the genres associated to the artist

  acousticness = track_features[0]['acousticness']
  danceability = track_features[0]['danceability']
  energy = track_features[0]['energy']
  instrumentalness = track_features[0]['instrumentalness']
  liveness = track_features[0]['liveness']
  loudness = track_features[0]['loudness']
  speechiness = track_features[0]['speechiness']
  tempo = track_features[0]['tempo']
  time_signature = track_features[0]['time_signature']
  key = track_features[0]["key"]
  mode = track_features[0]["mode"]
  valence = track_features[0]["valence"]

  track_features_list = [name, album, artist, release_date, length, popularity, explicit, acousticness, danceability, energy, 
                         instrumentalness, liveness, loudness, speechiness, tempo, time_signature, key, mode, valence]
  return track_features_list
In [ ]:
# XXX refers to the current month, day, and year in which you run the code

df_top_hits_XXX = pd.DataFrame(columns=['name', 'album', 'artist', 'release_date', 'length', 'popularity', 'explicit', 'acousticness',
                                            'danceability', 'energy', 'instrumentalness', 'liveness', 'loudness', 'speechiness',
                                           'tempo', 'time_signature', 'key', 'mode', 'valence'])

for id in songIDs:
  track_list = getTrackFeatures(id)
  df = pd.DataFrame(data=[track_list], columns=['name', 'album', 'artist', 'release_date', 'length', 'popularity', 'explicit', 'acousticness',
                                            'danceability', 'energy', 'instrumentalness', 'liveness', 'loudness', 'speechiness',
                                           'tempo', 'time_signature', 'key', 'mode', 'valence'])
  
  df_top_hits_XXX = pd.concat([df_top_hits_XXX, df], ignore_index=True)

df_top_hits_XXX.head()
Out[ ]:
name album artist release_date length popularity explicit acousticness danceability energy instrumentalness liveness loudness speechiness tempo time_signature key mode valence
0 Nobody Gets Me SOS SZA 2022-12-09 180853 77 True 0.8050 0.358 0.284 0 0.1830 -8.285 0.0285 99.796 3 7 1 0.276
1 Anti-Hero Midnights Taylor Swift 2022-10-21 200690 97 False 0.1300 0.637 0.643 0.000002 0.1420 -6.571 0.0519 97.008 4 4 1 0.533
2 Unholy (feat. Kim Petras) Unholy (feat. Kim Petras) Sam Smith 2022-09-22 156943 100 False 0.0130 0.714 0.472 0.000005 0.2660 -7.375 0.0864 131.121 4 2 1 0.238
3 CUFF IT RENAISSANCE Beyoncé 2022-07-29 225388 93 True 0.0368 0.780 0.689 0.00001 0.0698 -5.668 0.1410 115.042 4 7 1 0.642
4 Creepin' (with The Weeknd & 21 Savage) HEROES & VILLAINS Metro Boomin 2022-12-02 221520 86 True 0.4170 0.715 0.620 0 0.0822 -6.005 0.0484 97.950 4 1 0 0.172

In order to keep the data used in the beginning of this tutorial constant, let's read in data previously obtained by performing these same steps on November 7, 2022:

In [ ]:
df_top_hits_110722 = pd.read_csv('df_top_hits_110722.csv')
In [ ]:
df_top_hits_110722.drop(columns=["Unnamed: 0"], inplace=True)
df_top_hits_110722.head()

Saving the DataFrame we just created; exporting DataFrame as a csv file:

In [ ]:
df_top_hits_XXX.to_csv('df_top_hits_121122.csv')

Let's analyze this DataFrame containing the 50 most popular songs on Spotify as of 11/07/22. Unsurprisingly, most of these songs have a high popularity rating:

In [ ]:
pop_median = df_top_hits_110722["popularity"].median()
pop_mean = df_top_hits_110722["popularity"].mean()
pop_min = df_top_hits_110722["popularity"].min()
pop_max = df_top_hits_110722["popularity"].max()

print(f"Median popularity: {pop_median}")
print(f"Mean popularity: {pop_mean}")
print(f"Highest popularity: {pop_max}")
print(f"Lowest popularity: {pop_min}")

In order to understand the 'typical' values for each feature of these top songs, let's do some summary statistics for all variables. This will look a bit different for categorical versus quantitative variables.

In [ ]:
df_top_hits_110722['artist'].value_counts()[:10]

Looks like Taylor Swift, Lizzo, Harry Styles, and Doja Cat are tied for most songs by a single artist in the Top Hits Playlist for 11/07/22!

Let's calculate the mean for each of our variables to see the typical values among these most popular songs. Conveniently, Pandas allows us to use the mean() function on the entire DataFrame and it will broadcast it across each column.

In [ ]:
df_typicals_110722 = df_top_hits_110722.drop(columns=["name", "album", "artist", "release_date", "explicit"]).mean()
df_typicals_110722

You may have noticed that it doesn't make much sense to find the mean of the explicit variable. This is due to the fact that it is a categorical variable as opposed to quantitative like the rest. To find out if there are more explicit or non-explicit songs in the playlist, we can use value_counts() on this variable.

In [ ]:
df_top_hits_110722["explicit"].value_counts().idxmax()

# False -- there are more non-explicit than explicit songs!

The goal is to collect data from the "Today's Top Hits" playlist throughout the coming weeks in order to construct a singular DataFrame containing more rows. This way, we can build a predictive model for popularity that can be compared to the one that we built off of the Spotify Tracks Dataset. We will then be able to make conclusions about the importance of certain features in predicting popularity, what values of k minimizes mean absolute error in our KNeighborsRegressor model, and more.

After weeks of collecting data, our hope was that we would be able to construct a DataFrame with at least a couple hundred rows. This way, we would be able to create a significant predictive model for predicting song popularity that could then be compared to the previous model. However, a month's worth of data from the "Today's Top Hits" playlist was not enough to accomplish our goal:

In [ ]:
df_top_hits_110722 = pd.read_csv("df_top_hits_110722.csv")
df_top_hits_111122 = pd.read_csv("df_top_hits_111122.csv")
df_top_hits_111422 = pd.read_csv("df_top_hits_111422.csv")
df_top_hits_112822 = pd.read_csv("df_top_hits_112822.csv")
df_top_hits_120422 = pd.read_csv("df_top_hits_120422.csv")
df_top_hits_121122 = pd.read_csv("df_top_hits_121122.csv")

df_all_weeks = pd.concat([df_top_hits_110722,df_top_hits_111122,df_top_hits_111422,df_top_hits_112822,df_top_hits_120422,df_top_hits_121122]).drop_duplicates().reset_index(drop=True)

df_all_weeks.drop(columns="Unnamed: 0", inplace=True)

df_all_weeks.drop_duplicates(subset=['name']).shape
Out[ ]:
(64, 19)

As seen above, we were only able to construct a DataFrame of 64 songs. This is due to the fact that songs have stayed in this playlist throughout this entire month-long period of collecting the data. Therefore, we are unable to make a significant predictive model on this data. However, it gave us great practice with collecting data using Spotify's API. If the above process were to be repeated throughout the course of a few months/years, a significant model could be created.

Regardless, let's do some EDA on the data that we spent a month collecting. Let's start by analyzing the artists who appeared the most in the "Today's Top Hits" playlist throughout this time period.

In [ ]:
artists = df_all_weeks["artist"].value_counts()[0:10].rename_axis('artist').reset_index(name='frequency')

fig = plt.figure(figsize = (20, 8))
plt.bar(artists["artist"], artists["frequency"], color ='green', width = 0.4)
plt.xticks(size=15, rotation=30)
plt.yticks(size=15)
plt.xlabel("Artist", size=13)
plt.ylabel("Count", size=13)
plt.title("Most Persistent Artists in \"Today's Top Hits\" Playlist (Nov-Dec 2022)", size=25)
Out[ ]:
Text(0.5, 1.0, 'Most Persistent Artists in "Today\'s Top Hits" Playlist (Nov-Dec 2022)')

We can now see the distribution of artists with the most song entries in the playlist during this period.

Let's also take a look at the songs that stayed in the playlist the longest during this period:

In [ ]:
songs = df_all_weeks["name"].value_counts()[0:5].rename_axis('name').reset_index(name='frequency')

fig = plt.figure(figsize = (20, 8))
plt.bar(songs["name"], songs["frequency"], color ='red', width = 0.4)
plt.xticks(size=15)
plt.yticks(size=15)
plt.xlabel("Song", size=13)
plt.ylabel("Count", size=13)
plt.title("Most Persistent Songs in \"Today's Top Hits\" Playlist (Nov-Dec 2022)", size=25)
Out[ ]:
Text(0.5, 1.0, 'Most Persistent Songs in "Today\'s Top Hits" Playlist (Nov-Dec 2022)')

These are the five songs that stayed in the playlist the longest during this period. Since we pulled data 6 times, this means that "Shirt" and "Made You Look" were in the "Today's Top Hits" playlist each time we pulled new data.

Song Popularity: Before and After the COVID-19 Pandemic

Now that we have learned how to pull data from Spotify using their API, translate that data into a pandas DataFrame, perform summary statistics, and create predictive models using k-nearest neighbors linear regression, we will now use these skills to answer the final two questions listed at the start of this notebook:

How has the importance of the above features changed in predicting a song's popularity pre and 'post' COVID-19?

We hypothesize that people are listening to sadder, less positive music after the COVID-19 pandemic. Is this hypothesis supported?

In order to answer these questions, we will be looking at Billboard Top 100 data pre- and post-pandemic. We will use Spotify's API to access the same song features that we have been working with so far, but this time for those in the Top 100 list. We will create two DataFrames––one containing songs on the Billboard charts before the pandemic and one containing songs on the Billboard charts today ('after' the pandemic). We will then build two predictive models––one on the first DataFrame and one on the second––in order to assess the importance of each feature in predicting song popularity. This will be done by determining whether or not the p-value for each feature's weight is statistically significant to the prediciton.

Step 1: Collecting Pre- and Post-Pandemic Billboard Top 100 Chart Data

We can load up a csv file of all Billboard Top 100 Chart data since the year 1956 with the following command:

In [ ]:
url = "https://raw.githubusercontent.com/HipsterVizNinja/random-data/main/Music/hot-100/Hot%20100.csv"
billboard_df = pd.read_csv(url)

We will first downselect this data to only contain song content from the year 2019. This will act as our 'pre-pandemic' dataset. One thing to keep in mind is that this dataset, like the Spotify Tracks Dataset, contains multiple entries of the same songs depending on how many weeks they were in the Top 100. In order to have each song only be entered once, we will need to create a column that lists the number of times each song was entered into the dataset (i.e., how many weeks it was in the Top 100 during 2019).

In [ ]:
# downselecting for 2019 songs
billboard_df_2019 = billboard_df[(billboard_df["chart_date"].str.startswith("2019"))]
billboard_df_2019 = billboard_df_2019.sort_values(by=['song'])
# need to sort the df so it is in the order of the value_counts series

# finding number of times each song appears on the charts in 2019
longevity_series = billboard_df_2019["song"].value_counts()

# making every song only appear once
billboard_df_2019 = billboard_df_2019.drop_duplicates(subset=['song'])

# list of each song and how many times it was on the charts in 2019
longevity_series

Now that we have stored these numbers for each of the songs in this dataset into a variable longevity_series, we can sort the DataFrame to match up with this ordering:

In [ ]:
from pandas.api.types import CategoricalDtype

# sorting the dataframe to match the order of the longevity_series
cat_song_order = CategoricalDtype(
    longevity_series.index, 
    ordered=True
)

billboard_df_2019['song'] = billboard_df_2019['song'].astype(cat_song_order)
billboard_df_2019 = billboard_df_2019.sort_values('song')

# creating variables of the desired order for songs and artists
desired_song_order_list = billboard_df_2019["song"].values
desired_artist_order_list = billboard_df_2019["performer"].values

We have also stored the correct ordering of the songs into a variable called desired_song_order_list. Not only does this give us the desired order for the songs in our DataFrame, it can be easily used with Spotify's API to get track information for each song in the list.

In [ ]:
billboard_df_2019.head() # 2019 billboard data, now in order of most to least # of times on Chart in 2019

We can see with the command below that we hope to obtain information on 621 songs.

In [ ]:
# we have 621 songs we'd like to get spotify api info on
len(desired_song_order_list)

Let's iterate through each song in this list and attempt to fetch song IDs.

In [ ]:
# using spotify's api to get this info

track_id_list = []

for track in desired_song_order_list:
  try:
    track_json = sp.search(q='track:' + track, type='track')
    track_id = track_json['tracks']['items'][0]['id']
    track_id_list.append(track_id)
  except IndexError:
    track_id = "N/A"
    print(track)

We can see that we were unable to fetch the song ID for two songs: "Can't Leave Without It" and "Don't Check On Me." We can remove these from the longevity_series variable so that we can eventually add this Series to our DataFrame, and it will accurately indicate how many times each song appeared on the Charts in 2019.

(If we did not remove these songs from the Series and simply added the Series to the DataFrame we are about to create, the values would not line up correctly with each song because we will only be able to make a DataFrame of 619 songs).

Let's drop these two songs from longevity_series, checking the length of the series before and after to make sure we actually got rid of them:

In [ ]:
print(len(longevity_series))
longevity_series = longevity_series.drop(labels=["Can't Leave Without It", "Don't Check On Me"])
print(len(longevity_series))

We are now ready to utilize Spotify's API to make a DataFrame called df_billboard_2019. This DataFrame will contain each song that appeared on the Billboard Top 100 in 2019 and the exact same features we have been looking at since the beginning of this tutorial. We can use the getTrackFeatures() function that we previously made to easily fetch all of this track information.

In [ ]:
# creating a dataframe for the spotify api info on each of these songIDs
df_billboard_2019 = pd.DataFrame(columns=['name', 'album', 'artist', 'release_date', 'length', 'popularity', 'explicit', 'acousticness',
                                            'danceability', 'energy', 'instrumentalness', 'liveness', 'loudness', 'speechiness',
                                           'tempo', 'time_signature', 'key', 'mode', 'valence'])

for id in track_id_list:
  track_list = getTrackFeatures(id)
  df = pd.DataFrame(data=[track_list], columns=['name', 'album', 'artist', 'release_date', 'length', 'popularity', 'explicit', 'acousticness',
                                            'danceability', 'energy', 'instrumentalness', 'liveness', 'loudness', 'speechiness',
                                           'tempo', 'time_signature', 'key', 'mode', 'valence'])
  
  df_billboard_2019 = pd.concat([df_billboard_2019, df], ignore_index=True)

df_billboard_2019.head()

Let's drop any rows with NaNs and check that this DataFrame has the expected number of rows:

In [ ]:
df_billboard_2019 = df_billboard_2019.dropna()
len(df_billboard_2019)

Lastly, let's add a column "Number of Times on Chart" and set the values to be those of the longevity_series Series. Since we previously dropped the two songs that we were unable to find song IDs for, this Series can be easily added as a column, as it contains the right amount of elements.

In [ ]:
df_billboard_2019["Number of Times on Chart"] = longevity_series.values
In [ ]:
df_billboard_2019.head()
In [ ]:
import seaborn as sns
import matplotlib.pyplot as plt

fig, ax = plt.subplots(figsize=(8, 8))
 
sns.histplot(data=df_billboard_2019["Number of Times on Chart"], ax=ax, color='green')
plt.title("2019 Billboard Chart Longevity")
plt.show()

We can see from this histogram that there were a lot of songs in 2019 that were only on the Billboard Top 100 for less than 2 weeks. In contrast, there were few songs that remained on the chart for more than 25 weeks.

Our DataFrame looks great. Now, let's save it as a csv file:

In [ ]:
# saving this df:
df_billboard_2019.to_csv('df_billboard_2019.csv')

To avoid having to rerun the above code (as it takes a long time to run) we can just read in this DataFrame while working on this project:

In [ ]:
df_billboard_2019 = pd.read_csv("df_billboard_2019.csv")

Before we make a predictive model for song popularity using this dataset, let's first construct our "post-pandemic" dataset. We will follow the same steps as before, except we will initially downselect our dataset to only contain songs from 2022.

In [ ]:
# downselecting for 2022 songs
billboard_df_2022 = billboard_df[(billboard_df["chart_date"].str.startswith("2022"))]
billboard_df_2022 = billboard_df_2022.sort_values(by=['song'])
# need to sort the df so it is in the order of the value_counts series

# finding number of times each song appears on the charts in 2022
longevity_series_22 = billboard_df_2022["song"].value_counts()

# making every song only appear once
billboard_df_2022 = billboard_df_2022.drop_duplicates(subset=['song'])

# list of each song and how many times it was on the charts in 2022
longevity_series_22
In [ ]:
from pandas.api.types import CategoricalDtype

# sorting the dataframe to match the order of the longevity_series
cat_song_order = CategoricalDtype(
    longevity_series_22.index, 
    ordered=True
)

billboard_df_2022['song'] = billboard_df_2022['song'].astype(cat_song_order)
billboard_df_2022 = billboard_df_2022.sort_values('song')

# creating variables of the desired order for songs and artists
desired_song_order_list_22 = billboard_df_2022["song"].values
desired_artist_order_list_22 = billboard_df_2022["performer"].values
In [ ]:
billboard_df_2022.head() # 2022 billboard data, now in order of most to least # of times on Chart in 2022
In [ ]:
len(desired_song_order_list_22)
In [ ]:
# using spotify's api to get this info

track_id_22_list = []

for track in desired_song_order_list_22:
  try:
    track_json = sp.search(q='track:' + track, type='track')
    track_id = track_json['tracks']['items'][0]['id']
    track_id_22_list.append(track_id)
  except IndexError:
    track_id = "N/A"
    print(track)
In [ ]:
print(len(longevity_series_22))
longevity_series_22 = longevity_series_22.drop(labels=["Don't Think Jesus", "Christmas Tree Farm (Old Timey Version)", "I'd Do Anything To Make You Smile", "Oklahoma Smoke Show"])
print(len(longevity_series_22))
In [ ]:
# creating a dataframe for the spotify api info on each of these songIDs
df_billboard_2022 = pd.DataFrame(columns=['name', 'album', 'artist', 'release_date', 'length', 'popularity', 'explicit', 'acousticness',
                                            'danceability', 'energy', 'instrumentalness', 'liveness', 'loudness', 'speechiness',
                                           'tempo', 'time_signature', 'key', 'mode', 'valence'])

for id in track_id_22_list:
  track_list = getTrackFeatures(id)
  df = pd.DataFrame(data=[track_list], columns=['name', 'album', 'artist', 'release_date', 'length', 'popularity', 'explicit', 'acousticness',
                                            'danceability', 'energy', 'instrumentalness', 'liveness', 'loudness', 'speechiness',
                                           'tempo', 'time_signature', 'key', 'mode', 'valence'])
  
  df_billboard_2022 = pd.concat([df_billboard_2022, df], ignore_index=True)

df_billboard_2022.head()
In [ ]:
df_billboard_2022 = df_billboard_2022.dropna()
len(df_billboard_2022)
In [ ]:
df_billboard_2022["Number of Times on Chart"] = longevity_series_22.values
In [ ]:
df_billboard_2022.head()
In [ ]:
# saving this df:
df_billboard_2022.to_csv('df_billboard_2022.csv')

We can now just read in this DataFrame while working on this project:

In [ ]:
df_billboard_2022 = pd.read_csv("df_billboard_2022.csv")
df_billboard_2022_copy = df_billboard_2022.copy()

Step 2: Creating Predictive Models Using These DataFrames

We are going to first construct a model that will predict the number of times a song appeared on the Billboard Top 100 Chart in 2019 using the DataFrame df_billboard_2019. We will need to determine which features are the most important in predicting this label in order to get an accurate model. Once we know these features, we can best predict the number of times a song appeared on the Charts. We will repeat this whole process for the DataFrame df_billboard_2022, and see how the p-values have changed before and after the pandemic.

Let's start with the pre-pandemic data. We will train k-nearest neighbors regression models to predict the number of times a song was on the Charts in 2019 for different values of k. We will validate our model using 10-fold cross validation, and calculate the training and validation MAE of each model to determine the best value for k.

There are many variables in our datasets, some of which will not be very helpful in constructing a predictive model. We must investigate which variables will act as the best features for predicting our label of "Number of Times on Chart". To do this, let's use a Python library known as statsmodels to calculate each quantitative variable's p-value.

In [ ]:
import statsmodels.api as sm

# making a copy of dataframe
df_billboard_2019_copy = df_billboard_2019.copy()

potential_features = ['length', 'popularity', 'acousticness', 'danceability',
                            'energy', 'instrumentalness', 'liveness', 'loudness',
                            'speechiness', 'tempo', 'time_signature', 'key', 'mode', 'valence']

# define predictor and response variables
y = df_billboard_2019_copy["Number of Times on Chart"]
x = df_billboard_2019_copy[potential_features]

# add constant to predictor variables
x = sm.add_constant(x)

# fit linear regression model
model = sm.OLS(y, x).fit()

p_vals_2019 = []

for x in range (0, 14):
  p_vals_2019.append(model.pvalues[x])
  print(f"The p-value for {potential_features[x]} is {model.pvalues[x]}")
The p-value for length is 0.01913840003611477
The p-value for popularity is 0.027387017094701538
The p-value for acousticness is 5.1940228934844405e-11
The p-value for danceability is 0.09021632931923151
The p-value for energy is 0.9873908382078935
The p-value for instrumentalness is 0.04839219911699771
The p-value for liveness is 0.5517950995039411
The p-value for loudness is 0.032331565909685174
The p-value for speechiness is 0.015102938944052619
The p-value for tempo is 0.8884891710765106
The p-value for time_signature is 0.04408846110899298
The p-value for key is 0.37130985202204025
The p-value for mode is 0.2356982152183616
The p-value for valence is 0.6645892613643811
/usr/local/lib/python3.8/dist-packages/statsmodels/tsa/tsatools.py:142: FutureWarning: In a future version of pandas all arguments of concat except for the argument 'objs' will be keyword-only
  x = pd.concat(x[::order], 1)

A p-value less than 0.05 is considered to be statistically significant. Let's iterate through the p_vals list we just created and extract only the p-values less than 0.05.

In [ ]:
desired_features_2019 = []

for p in range(0, 14):
  if p_vals_2019[p] < 0.05:
    desired_features_2019.append(potential_features[p])
    print(f"{potential_features[p]}: {p_vals_2019[p]}")
length: 0.01913840003611477
popularity: 0.027387017094701538
acousticness: 5.1940228934844405e-11
instrumentalness: 0.04839219911699771
loudness: 0.032331565909685174
speechiness: 0.015102938944052619
time_signature: 0.04408846110899298

Now that we know which variables have the most statistically significant p-values, we can use these as features for our model to predict "Number of Times on Chart".

In [ ]:
from sklearn.feature_extraction import DictVectorizer
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsRegressor
from sklearn.pipeline import Pipeline
from sklearn.model_selection import cross_validate
import matplotlib.pyplot as plt
from sklearn.model_selection import cross_val_score


# get the features (in dict format) and the labels
# (do not split into training and validation sets)
x_dict = df_billboard_2019_copy[desired_features_2019].to_dict(orient="records")
y = df_billboard_2019_copy["Number of Times on Chart"]

# specify the pipeline
vec = DictVectorizer(sparse=False)
scaler = StandardScaler()

# options for k
k_opts = [i for i in range(1, 50, 1)]

# empty list of validation errors
validation_maes = []

# empty list of training errors
training_maes = []

# creating each model, fitting to training set, predicting y_val, calculating MAE, and appending to list
for k in k_opts:
    model = KNeighborsRegressor(n_neighbors=k)
    pipeline = Pipeline([("vectorizer", vec), ("scaler", scaler), ("fit", model)])
    scores = cross_validate(pipeline, x_dict, y, cv=10, scoring="neg_mean_absolute_error", return_train_score=True)
    average_validation_mae = -scores["test_score"].mean()
    validation_maes.append(average_validation_mae)
    average_training_mae = -scores["train_score"].mean()
    training_maes.append(average_training_mae)
In [ ]:
# plotting
fig, ax = plt.subplots(1, 1, figsize=(8, 8))
ax.plot(k_opts, validation_maes, marker="o",linestyle="-", markersize=5, color="red", markeredgecolor="white", alpha=0.8)
ax.plot(k_opts, training_maes, marker="o",linestyle="-", markersize=5, color="blue", markeredgecolor="white", alpha=0.8)
ax.set_xlabel('k', size=25)
ax.set_ylabel('Mean Absolute Error (MAE)', size=25)
ax.set_title('Training & Validation MAE of KNN Models', size=30)
ax.legend(['Validation MAE', 'Training MAE'])
Out[ ]:
<matplotlib.legend.Legend at 0x7fe2ac326b50>

We can see from the graph that the training MAE is smaller than the validation MAE. This makes sense because it is harder for our model to predict data it has not seen before than data it has already seen. As the value of k increases, the training MAE hovers around 6, while the validation MAE hovers around 8. This means that our model is typically around 6 weeks off when predicting the number of weeks a song has been on the Top 100 Charts for data it has already seen, and typically around 8 weeks off for data it has not yet seen.

Now, let's determine the best value for k (i.e., the value of k that minimizes the validation MAE):

In [ ]:
vec = DictVectorizer(sparse=False)
scaler = StandardScaler()

x_dict = df_billboard_2019_copy[desired_features_2019].to_dict(orient="records")
y = df_billboard_2019_copy["Number of Times on Chart"]
In [ ]:
def get_cv_error(k):
    model = KNeighborsRegressor(n_neighbors=k)
    pipeline = Pipeline([("vectorizer", vec), ("scaler", scaler), ("fit", model)])
    mae = np.mean(-cross_val_score(
        pipeline, x_dict, y, 
        cv=10, scoring="neg_mean_absolute_error"
    ))
    return mae
In [ ]:
from matplotlib import pyplot as plt

ks = pd.Series(range(1, 51))
ks.index = range(1, 51)
test_errs = ks.apply(get_cv_error)

test_errs.plot.line(figsize=(8,8), color='green')
plt.title("Validation MAE as a Function of K", size=30)
plt.xlabel("k", size=25)
plt.ylabel("Validation MAE", size=25)
Out[ ]:
Text(0, 0.5, 'Validation MAE')
In [ ]:
print(f"The best value for k is {test_errs.idxmin()}, where the Validation MAE is {test_errs[test_errs.idxmin()]}.")
The best value for k is 5, where the Validation MAE is 7.592099418297197.

On average, our model is around 7 and a half weeks off when predicting the number of times a song was on the charts in 2019. Now, let's create a predictive model on the 2022 DataFrame. We will be utilizing the popular machine learning algorithm XGBoost to create this model as opposed to repeating KNN.

In [ ]:
import xgboost as xgb
import numpy as np
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split

x = df_billboard_2022_copy[potential_features]
y = df_billboard_2022_copy["Number of Times on Chart"]

data_dmatrix = xgb.DMatrix(data=x,label=y)

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=123)

xg_reg = xgb.XGBRegressor(objective ='reg:linear', colsample_bytree = 0.7, learning_rate = 0.1,
                max_depth = 5, alpha = 10, n_estimators = 25)

xg_reg.fit(x_train,y_train)

preds = xg_reg.predict(x_test)

rmse = np.sqrt(mean_squared_error(y_test, preds))

print("RMSE: %f" % (rmse))
[03:52:28] WARNING: /workspace/src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
RMSE: 7.003653

With a bit of hyper-parameter tuning, we were able to get a RMSE of around 7. This means that our model is around 7 weeks off on average when predicting the number of times a song was on the Billboard charts in 2022.

In [ ]:
params = {"objective":"reg:linear",'colsample_bytree': 0.3,'learning_rate': 0.1,
                'max_depth': 5, 'alpha': 10}

cv_results = xgb.cv(dtrain=data_dmatrix, params=params, nfold=3,
                    num_boost_round=50,early_stopping_rounds=10,metrics="rmse", as_pandas=True, seed=123)
[04:01:37] WARNING: /workspace/src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
[04:01:37] WARNING: /workspace/src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
[04:01:37] WARNING: /workspace/src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
In [ ]:
print((cv_results["test-rmse-mean"]).tail(1))
28    7.556
Name: test-rmse-mean, dtype: float64

Through 3-fold cross validation, we obtained an RMSE of 7.556 weeks. This error is very similar to the error of the model that we built on pre-COVID data using K-Nearest Neighbors Regression.

Now that we have built predictive models on the pre- and 'post'- COVID-19 datasets, let's move our analysis to the change in beta coefficients. In order to analyze how the importance of each variable has changed, we need to fit linear regression models to our 2019 and 2022 data individually. We will do so using SciKit Learn. This will allow us to retrieve the coefficients of each variable in relation to the dependent variable of "Number of TImes on Chart".

Let's start by obtaining the coefficients for the 2019 data.

In [ ]:
# using sklearn's LinearRegression to determine coefficients for features in 2019

from sklearn.linear_model import LinearRegression

x_2019 = df_billboard_2019_copy[['length', 'popularity', 'acousticness', 'danceability',
                            'energy', 'instrumentalness', 'liveness', 'loudness',
                            'speechiness', 'tempo', 'time_signature', 'key', 'mode', 'valence']]

x_2019_sc = (x_2019 - x_2019.mean()) / x_2019.std()

y_2019 = df_billboard_2019_copy["Number of Times on Chart"]
reg = LinearRegression().fit(x_2019_sc, y_2019)

reg.predict(x_2019)

reg.coef_
Out[ ]:
array([-0.80503422,  2.41042106, -0.83825661,  0.00688061, -1.24465038,
        0.2182236 , -0.75314227,  1.30308066, -0.05254998, -0.72070997,
       -0.31758709, -0.42418268,  0.15680388,  0.58327575])
In [ ]:
betas_2019 = pd.DataFrame({"Feature":x_2019.columns.tolist(),"Coefficients":reg.coef_})
betas_2019
Out[ ]:
Feature Coefficients
0 length -0.805034
1 popularity 2.410421
2 acousticness -0.838257
3 danceability 0.006881
4 energy -1.244650
5 instrumentalness 0.218224
6 liveness -0.753142
7 loudness 1.303081
8 speechiness -0.052550
9 tempo -0.720710
10 time_signature -0.317587
11 key -0.424183
12 mode 0.156804
13 valence 0.583276

We can now see the relationship between each feature and the variable "Number of Times on Chart" for 2019. Note that the sign of the coefficient indicates either a positive or negative relationship. If the coefficient is positive, that means that as that variable increases, we also tend to see an increase in "Number of Times on Chart". But if the coefficient is negative, we see a decrease in "Number of TImes on Chart" as that variable increases.

Let's retrieve the coefficients for the 2022 data.

In [ ]:
# using sklearn's LinearRegression to determine coefficients for features in 2022

from sklearn.linear_model import LinearRegression

x_2022 = df_billboard_2022_copy[['length', 'popularity', 'acousticness', 'danceability',
                            'energy', 'instrumentalness', 'liveness', 'loudness',
                            'speechiness', 'tempo', 'time_signature', 'key', 'mode', 'valence']]

x_2022_sc = (x_2022 - x_2022.mean()) / x_2022.std()

y_2022 = df_billboard_2022_copy["Number of Times on Chart"]
reg = LinearRegression().fit(x_2022_sc, y_2022)

reg.coef_
Out[ ]:
array([-0.3438397 ,  2.31582618, -0.09930078, -0.07658117, -0.55026726,
       -0.07302853, -0.35291977,  1.63464149, -0.93313997, -0.35339343,
        0.09506755,  0.23979721, -0.26471031, -0.37073683])
In [ ]:
betas_2022 = pd.DataFrame({"Feature":x_2022.columns.tolist(),"Coefficients":reg.coef_})

We can now combine the data into one DataFrame so we can easily see how the coefficients have changed from 2019 to 2022.

In [ ]:
betas = betas_2019.merge(betas_2022, on="Feature")
betas.rename(columns={"Coefficients_x":"Coefficients 2019", "Coefficients_y":"Coefficients 2022"}, inplace=True)

betas["Change from 2019 to 2022"] = betas["Coefficients 2022"] - betas["Coefficients 2019"]
betas
Out[ ]:
Feature Coefficients 2019 Coefficients 2022 Change from 2019 to 2022
0 length -0.805034 -0.343840 0.461195
1 popularity 2.410421 2.315826 -0.094595
2 acousticness -0.838257 -0.099301 0.738956
3 danceability 0.006881 -0.076581 -0.083462
4 energy -1.244650 -0.550267 0.694383
5 instrumentalness 0.218224 -0.073029 -0.291252
6 liveness -0.753142 -0.352920 0.400223
7 loudness 1.303081 1.634641 0.331561
8 speechiness -0.052550 -0.933140 -0.880590
9 tempo -0.720710 -0.353393 0.367317
10 time_signature -0.317587 0.095068 0.412655
11 key -0.424183 0.239797 0.663980
12 mode 0.156804 -0.264710 -0.421514
13 valence 0.583276 -0.370737 -0.954013

The goal of doing this is to understand which features have changed in importance for predicting "Number of Times on Chart". But how do we define significant "change"? For our purposes, we are interested in determining which features' coefficients have changed sign from 2019 to 2022 because this shows a direct change in relationship to "Number of Times on Chart".

Let's now graphically represent the change in coefficients from 2019 to 2022.

In [ ]:
betas.plot(x="Feature", y=["Coefficients 2019", "Coefficients 2022"], ylabel="Importance",
           fontsize=15, kind="bar", figsize=(20,8), title="Change in Coefficient Values Pre- and 'Post'- COVID-19")
Out[ ]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f77c1377550>

The graph makes it clear that the coefficients of the features danceability, instrumentalness, time_signature, key, mode, and valence have changed sign from 2019 to 2022.

Recall that our hypothesis is that people are generally listening to sadder, less positive music after the COVID-19 pandemic. To assess the validity of this hypothesis, let's take a deeper look at how the variables danceability, mode, and valence really changed from 2019 to 2022.

In [ ]:
dance = betas[(betas["Feature"] == "danceability")]
mode = betas[(betas["Feature"] == "mode")]
valence = betas[(betas["Feature"] == "valence")]

important = pd.concat([dance, mode, valence], axis=0)
important.plot(x="Feature", y=["Coefficients 2019", "Coefficients 2022"], ylabel="Importance",
           fontsize=15, kind="bar", figsize=(20,8), title="Change in Coefficient Values Pre- and 'Post'- COVID-19")
Out[ ]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f77bd82d700>

We can clearly see that these variables have changed from having a positive relationship with "Number of Times on Chart" in 2019 to a negative relationship in 2022. Recall the definitions of these variables at the beginning of this notebook. We can clearly see that danceability and valence have a positive connotation when it comes to music––but mode needs a bit more explaining.

As described, mode "indicates the modality (major or minor) of a track, the type of scale from which its melodic content is derived. Major is represented by 1 and minor is 0." Before COVID-19, we were seeing a positive relationship between mode and "Number of Times on Chart". This means that songs with a major modality were more strongly associated with predicting "Number of Times on Chart". However, after the pandemic we see the opposite relationship. This means that we are seeing more songs with a minor modality predicting "Number of Times on Chart".

Songs with a minor key usually evoke feelings of sadness. Therefore, this change in the coefficient for mode supports our hypothesis. Additionally, danceability and valence changing from having a positive to negative relationship with "Number of Times on Chart" further supports our hypothesis. Overall, our hypothesis is generally supported based off of these variables.

Summary and Conclusions

Throughout this notebook, we have learned key data science concepts for collecting, analyzing, and representing data.

  • We learned how to clean data from the internet and create a model to predict the popularity of a song on Spotify.
  • We gained familiarity with Spotify's API to obtain song information in hopes of creating a predictive model on our own data that could be compared to the previous one. However, we were unable to construct a DataFrame with enough rows for a significant predictive model.
  • We constructed DataFrames out of Billboard Top 100 Chart data that we used for pre- and 'post'- COVID-19 analysis. We gained experience in k-nearest neighbors regression and XGBoost as well as analyzing model errors.
  • We saw significant changes in importance for predicting the number of times a song was on the Billboard Charts in the variables danceability, instrumentalness, time_signature, key, mode, and valence.
  • Overall, our hypothesis that people are listening to sadder, less positive music after the COVID-19 pandemic is generally supported.
In [ ]:
%%shell
jupyter nbconvert --to html /content/Milestone2.ipynb