Sam DeMarinis and Nick Radwin
Link to our webpage: https://samdemarinis.github.io/
The goal of this project is to understand what features of a song are strongly associated with high popularity in order to build an accurate model that predicts a given song's popularity. We will be incorporating datasets found online that have utilized Spotify's API as well as creating our own to see how a prediction for a song's popularity may change depending on the dataset used to build the model. Lastly, we will be utilizing Billboard Top 100 data before and after the COVID-19 pandemic to investigate how the importance of certain features of a song in predicting popularity may have changed from before/after the pandemic.
Where did the dataset we are utilizing come from? How was it obtained and how are the variables defined?
How can we build an accurate model to predict a song's 'popularity' based off features such as 'name', 'album', 'artist', 'release_date', 'length', 'explicit', 'acousticness', 'danceability', 'energy', 'instrumentalness', 'liveness', 'loudness', 'speechiness', 'tempo', 'time_signature', 'key', 'mode', and 'valence'?
What features provided by Spotify's API are the most important in predicting song popularity?
Can we build a predictive model for popularity on Spotify data that we pulled ourselves? If so, how does this model compare to the one that we built on the pre-packaged dataset "Spotify Tracks Dataset"?
Is there a relationship between song popularity on Spotify and song popularity on the Billboard Top 100? Are we able to predict how many weeks a song will appear on the Billboard Top 100 Chart in a year?
How has the importance of the above features changed in predicting a song's popularity pre- and 'post-' COVID-19?
We hypothesize that people are listening to sadder, less positive music after the COVID-19 pandemic. Is this hypothesis supported?
The first dataset that we are working with is the Spotify Tracks Dataset available on Kaggle (https://www.kaggle.com/datasets/maharshipandya/-spotify-tracks-dataset). This dataset was constructed by the Kaggle user @MAHARSHIPANDYA in 2022 using Spotify's API. It contains almost 120,000 rows each corresponding to a song on Spotify. While the method for selecting the songs is not clearly stated, it can be assumed that it was a randomized process due to the large variety of songs. The columns of the dataset describe features of the songs, ranging from basic identifiers such as track_id and and album_name to more musical qualities such as energy and danceability. The full breakdown of what each variable entails can be found under the Column Description of the Spotify Tracks Dataset, linked above.
The variable that we are initially interested in predicting is popularity. But how does one measure things like popularity, energy, or danceability? Spotify Web API describes how these seemingly subjective measures were calculated. Below is a list of the most subjective variables in the dataset and how they were calculated.
popularity: "The value will be between 0 and 100, with 100 being the most popular. The popularity of a track is a value between 0 and 100, with 100 being the most popular. The popularity is calculated by algorithm and is based, in the most part, on the total number of plays the track has had and how recent those plays are."
danceability: "Danceability describes how suitable a track is for dancing based on a combination of musical elements including tempo, rhythm stability, beat strength, and overall regularity. A value of 0.0 is least danceable and 1.0 is most danceable."
energy: "Energy is a measure from 0.0 to 1.0 and represents a perceptual measure of intensity and activity. Typically, energetic tracks feel fast, loud, and noisy. For example, death metal has high energy, while a Bach prelude scores low on the scale."
speechiness: "Speechiness detects the presence of spoken words in a track. The more exclusively speech-like the recording (e.g. talk show, audio book, poetry), the closer to 1.0 the attribute value. Values above 0.66 describe tracks that are probably made entirely of spoken words. Values between 0.33 and 0.66 describe tracks that may contain both music and speech, either in sections or layered, including such cases as rap music. Values below 0.33 most likely represent music and other non-speech-like tracks."
liveness: "Detects the presence of an audience in the recording. Higher liveness values represent an increased probability that the track was performed live. A value above 0.8 provides strong likelihood that the track is live."
valence: "A measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track. Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry)."
mode: "Mode indicates the modality (major or minor) of a track, the type of scale from which its melodic content is derived. Major is represented by 1 and minor is 0."
Now that we understand how these variables were quantified, we can properly assess their level of subjectivity. Popularity, the label we will be initially predicting, is based on an algorithm involving a track's plays. This variable is not very subjective. However, variables like danceability and valence still contain some subjectivity. Knowing this and what it is we want to predict, we can begin extracting, transforming, and loading this dataset.
First, we must mount our Google Drive and change the directory to our Drive.
from google.colab import drive
import pandas as pd
import pylab as pl
drive.mount('/content/drive')
%cd /content/drive/My Drive
Reading in the Spotify Tracks Datatset csv file and making a copy:
spotify_df = pd.read_csv("dataset.csv")
spotify_df_copy = spotify_df.copy()
spotify_df_copy.head()
The dataset already seems to be pretty tidy, but there is some more tidying to be done. There is an extra column with no purpose, so we can go ahead and remove it:
spotify_df_copy.drop(columns=["Unnamed: 0"], inplace=True)
We should always check the dtypes for each of our variables. Pandas may misinterpret certain variables as categorical instead of quanitative, or vice versa. Luckily, we can see that Pandas correctly interpreted the dtype for each variable in our DataFrame:
spotify_df_copy.dtypes
We should also check for duplicate and NaN values in our DataFrame:
spotify_df_copy.duplicated().sum()
spotify_df_copy.isnull().sum().sum()
We can now drop the duplicate rows and rows containing NaNs:
spotify_df_copy.drop_duplicates(inplace=True)
spotify_df_copy.dropna(inplace=True)
spotify_df_copy.head()
Our DataFrame is now in great shape for analysis. An important note to make is that, while we requested for Pandas to drop duplicate rows (or tracks) in our DataFrame, there are still multiple rows that are the same song. This is due to the fact that some songs are listed as being in more than one genre, and for each genre they fall into they are entered another time into the dataset. Pandas does not interpret these rows as being duplicates because they differ on the track_genre variable. To see this, let's look up all rows that have a specific track_id:
spotify_df_copy[(spotify_df_copy['track_id'] == "2QjOHCTQ1Jl3zawyYOpxh6")]
Let's investigate this table before we start constructing a model. Since we are interested in analyzing and eventually creating a model to predict the popularity of a song, let's look up songs with a popularity rating greater than 90:
# let's look at songs with high popularity:
spotify_df_copy["popularity"].unique()
popular_songs = spotify_df_copy[(spotify_df_copy["popularity"] >= 90)]["track_id"].unique()
popular_songs
Let's say that we are interested in looking at the energy and danceability of these popular songs, and if there is a common trend for this variable:
# let's look at the energy of these songs:
energy_list = []
dance_list = []
for i in popular_songs:
x = spotify_df_copy[(spotify_df_copy['track_id'] == i)]["energy"].iloc[0]
y = spotify_df_copy[(spotify_df_copy['track_id'] == i)]["danceability"].iloc[0]
energy_list.append(x)
dance_list.append(y)
data = {'track_id':popular_songs,
'Energy':energy_list,
'Danceability':dance_list
}
df_pop_ED = pd.DataFrame(data, index=range(0,43))
df_pop_ED.head()
We can plot these energy and danceability ratings:
import seaborn as sns
import matplotlib.pyplot as plt
fig, ax = plt.subplots(figsize=(8, 8))
sns.histplot(data=df_pop_ED, x='Energy', color='green', bins=20, stat='frequency', kde=True).set(title='Energy Rating of Most Popular Songs')
plt.show()
fig, ax = plt.subplots(figsize=(8, 8))
sns.histplot(data=df_pop_ED, x='Danceability', color='green', bins=15, stat='frequency', kde=True).set(title='Danceability Rating of Most Popular Songs')
plt.show()
From the histograms, we can see that the most common energy rating among the most popular songs is around 0.7 out of 1, and the most common dancability is around 0.65 out of 1.
Now, let's take a look at the most common genres of the most popular songs:
popular_songs_df = spotify_df_copy[(spotify_df_copy["popularity"] >= 90)]
fig, ax = plt.subplots(figsize=(12, 6))
plot2 = sns.countplot(data=popular_songs_df, x='track_genre', order = popular_songs_df['track_genre'].value_counts().index, ax=ax).set(title='Genre of Most Popular Songs', xlabel='Genre', ylabel='Frequency')
ax.set_xticklabels(ax.get_xticklabels(), rotation=45, horizontalalignment='right')
plt.show()
We can now see the distribution of genres among the most popular songs. Unsurprisingly, the most common genre of these songs is pop.
From our exploratory data analysis, we can see that popular songs are generally pop songs that have an energy rating of about 0.7 a dancability rating of about 0.65. However, this isn't very helpful if we are producing an R&B song, for example. So, let's build a predictive model that allows us to generalize across genre to predict the popularity rating of a song just using its features!
Now that we have looked at some of the summary statistics for this dataset, let's begin constructing a predictive model. We will use the machine learning algorithm of k-nearest neighbors regression and determine which value of k results in the lowest mean absolute error.
Hyperparameter Tuning
Let's first determine the best k value (number of neighbors) for our model:
from sklearn.feature_extraction import DictVectorizer
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsRegressor
from sklearn.pipeline import Pipeline
import numpy as np
from sklearn.model_selection import cross_val_score
vec = DictVectorizer(sparse=False)
scaler = StandardScaler()
features = ['acousticness', 'danceability',
'energy', 'instrumentalness', 'liveness', 'loudness',
'speechiness', 'tempo', 'time_signature', 'key', 'mode', 'valence']
small_df = spotify_df_copy.head(100)
X_dict = small_df[features].to_dict(orient="records")
y = small_df["popularity"]
# calculates estimate of test error based on 10-fold cross validation
def get_cv_error(k):
model = KNeighborsRegressor(n_neighbors=k, n_jobs=-1)
pipeline = Pipeline([("vectorizer", vec), ("scaler", scaler), ("fit", model)])
mse = np.mean(-cross_val_score(
pipeline, X_dict, y,
cv=10, scoring="neg_mean_squared_error"
))
return mse
ks = pd.Series(range(1, 51))
ks.index = range(1, 51)
test_errs = ks.apply(get_cv_error)
test_errs.plot.line(figsize=(6,6), color='green')
plt.title("Validation MAE as a Function of K", size=30)
plt.xlabel("k", size=25)
plt.ylabel("Validation MAE", size=25)
lst = []
for i,t in enumerate(test_errs):
lst.append(get_cv_error(i+1))
print(np.array(lst).argmin())
According to the minimum validation Mean Average Error (MAE), the best value of k for our predictive model is 5. Now we can build our predictive model!
5-Nearest Neighbor Predictive Model:
# define the training data
X_train = spotify_df_copy[[
"danceability",
"energy",
"loudness",
"speechiness",
"instrumentalness",
"time_signature"
]]
y_train = spotify_df_copy["popularity"]
# standardize the data
scaler = StandardScaler()
scaler.fit(X_train)
X_train_sc = scaler.transform(X_train)
# fit the 9-nearest neighbors model
model = KNeighborsRegressor(n_neighbors=5)
model.fit(X_train_sc, y_train)
# define the test data (Scikit-Learn expects a matrix)
x_new = pd.DataFrame()
x_new["danceability"] = [0.420]
x_new["energy"] = [0.166]
x_new["loudness"] = [-17.235]
x_new["speechiness"] = [0.0763]
x_new["instrumentalness"] = [0.000006]
x_new["time_signature"] = [4]
x_new_sc = scaler.transform(x_new)
# use the model to predict on the test data
model.predict(x_new_sc)
A song with these specific feature values has a true popularity of 55/100. However, our model predicts that this song would have a popularity of about 42/100. Let's take a look at how much error there is in our model on average using cross-validation:
Cross-Validation Error Report
from sklearn.feature_extraction import DictVectorizer
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsRegressor
from sklearn.pipeline import Pipeline
from sklearn.model_selection import cross_val_score
import numpy as np
# get the features (in dict format) and the labels
# (do not split into training and validation sets)
features = [
"danceability",
"energy",
"loudness",
"speechiness",
"instrumentalness",
"time_signature"
]
X_dict = spotify_df_copy[features].to_dict(orient="records")
y = spotify_df_copy["popularity"]
# specify the pipeline
vec = DictVectorizer(sparse=False)
scaler = StandardScaler()
model = KNeighborsRegressor(n_neighbors=5)
pipeline = Pipeline([("vectorizer", vec), ("scaler", scaler), ("fit", model)])
scores = cross_val_score(pipeline, X_dict, y,
cv=5, scoring="neg_mean_squared_error")
np.sqrt(np.mean(-scores))
The average error of our model is about 22 popularity points out of 100. After fine-tuning with an ablation study looking at different combinations of features to predict popularity, we concluded that the average error doesn't get much better than 22 points. So a song's features are relatively good at predicting popularity, but it is important to keep external factors in mind such as the historical context around a song's release that also contribute to a particular song's popularity.
We have just created a predictive model for song popularity based on the Spotify Tracks Dataset. Now, we are interested in constructing our own dataset to see how varying sources of data describing the same units of observation and features can result in more/less accurate predictive models. In order to build our own dataset of Spotify data, we must use Spotify's API.
A popular Python library for the Spotify Web API is called Spotipy. We will be utilizing this library because it is easy to unerstand and allows us to demonstrate commands in this notebook.
# installing spotipy
!pip install urllib3
!pip install requests
!pip install spotipy
from fastai.collab import *
from fastai.tabular import *
import pandas as pd
import numpy as np
import spotipy
import spotipy.util as util
from spotipy.oauth2 import SpotifyClientCredentials
import spotipy.oauth2 as oauth2
from spotipy.oauth2 import SpotifyOAuth
In order to make calls to Spotify's API using Spotipy, we must register a Spotify account as a Developer. We will then create an Application, obtain our Client ID and Secret, and enter in a specific Redirect URI.
# Sam's Spotify credentials
CLIENT_ID = "92f8907c269f4703917254ec2454f4c7"
CLIENT_SECRET = "fea4962791dd466d840feba0e11a11df"
username = "sam7303"
market = ['US']
redirect_uri = 'http://127.0.0.1:9090'
scope="user-library-read"
Now that we have our credentials stored into variables, we can create a Spotipy object that authorizes us as a valid Developer:
# creating spotipy object
sp = spotipy.Spotify(auth_manager=SpotifyOAuth(client_id=CLIENT_ID,
client_secret=CLIENT_SECRET,
redirect_uri=redirect_uri,
scope="user-library-read", open_browser=False))
We are interested in understanding what features may be indicative of a song's popularity at this point in time. Every few days, Spotify updates the playlist "Today's Top Hits" to contain the top 50 most played songs on Spotify. This is a good indication that a song is popular. We will be analyzing the features of the songs in the playlist over the course of a few weeks so that we can see the breakdown of the features.
We can obtain the playlist ID by looking at the end of its URL. Next, we can access each of the tracks in the playlist with the command below:
(Note: The code below will fetch the most recent songs in the "Today's Top Hits" playlist. The returned song data depends on when the code is run.)
# "Today's Top Hits" playlist -- 50 of the most popular songs
PLAYLIST_ID = '37i9dQZF1DXcBWIGoYBM5M'
playlist_tracks = sp.user_playlist_tracks(CLIENT_ID, PLAYLIST_ID, fields='items,uri,name,id,total')
Now, we can find the song ID for each song in this playlist. This will allow us to then access the desired features for each song and create a dataset.
songIDs = []
for i in range(len(playlist_tracks['items'])):
songIDs.append(playlist_tracks['items'][i]['track']['id'])
print(songIDs)
len(songIDs)
We can utilize the track() function provided by Spotipy to access important information and features about each song. For example, the following code accesses a few of these features for a specific song:
track_info = sp.track("0V3wPSX9ygBnCm8psDIegu")
track_features = sp.audio_features('0V3wPSX9ygBnCm8psDIegu')
# some important info
name = track_info['name']
album = track_info['album']['name']
# some important features
acousticness = track_features[0]['acousticness']
danceability = track_features[0]['danceability']
print(f'Song name: {name}')
print(f'Album: {album}')
print(f'Acousticness: {acousticness}')
print(f'Danceability: {danceability}')
Now, we will follow these steps for each of the songs in the "Today's Top Hits" playlist. We will define a function that will give a list of track features given a song ID, and then call that function on each song ID in the songIDs list so that we can construct a DataFrame.
def getTrackFeatures(songID):
track_info = sp.track(songID)
track_features = sp.audio_features(songID)
name = track_info['name']
album = track_info['album']['name']
artist = track_info['album']['artists'][0]['name']
release_date = track_info['album']['release_date']
length = track_info['duration_ms']
popularity = track_info['popularity']
explicit = track_info['explicit']
#genre = track_info['genre']
# currently, spotify does not list the specific genre a track falls in. instead it offers the genres associated to the artist
acousticness = track_features[0]['acousticness']
danceability = track_features[0]['danceability']
energy = track_features[0]['energy']
instrumentalness = track_features[0]['instrumentalness']
liveness = track_features[0]['liveness']
loudness = track_features[0]['loudness']
speechiness = track_features[0]['speechiness']
tempo = track_features[0]['tempo']
time_signature = track_features[0]['time_signature']
key = track_features[0]["key"]
mode = track_features[0]["mode"]
valence = track_features[0]["valence"]
track_features_list = [name, album, artist, release_date, length, popularity, explicit, acousticness, danceability, energy,
instrumentalness, liveness, loudness, speechiness, tempo, time_signature, key, mode, valence]
return track_features_list
Inspiration for this function was found here: https://levelup.gitconnected.com/extracting-and-analysing-spotify-tracks-with-python-d1466fc1dfee
# XXX refers to the current month, day, and year in which you run the code
df_top_hits_XXX = pd.DataFrame(columns=['name', 'album', 'artist', 'release_date', 'length', 'popularity', 'explicit', 'acousticness',
'danceability', 'energy', 'instrumentalness', 'liveness', 'loudness', 'speechiness',
'tempo', 'time_signature', 'key', 'mode', 'valence'])
for id in songIDs:
track_list = getTrackFeatures(id)
df = pd.DataFrame(data=[track_list], columns=['name', 'album', 'artist', 'release_date', 'length', 'popularity', 'explicit', 'acousticness',
'danceability', 'energy', 'instrumentalness', 'liveness', 'loudness', 'speechiness',
'tempo', 'time_signature', 'key', 'mode', 'valence'])
df_top_hits_XXX = pd.concat([df_top_hits_XXX, df], ignore_index=True)
df_top_hits_XXX.head()
In order to keep the data used in the beginning of this tutorial constant, let's read in data previously obtained by performing these same steps on November 7, 2022:
df_top_hits_110722 = pd.read_csv('df_top_hits_110722.csv')
df_top_hits_110722.drop(columns=["Unnamed: 0"], inplace=True)
df_top_hits_110722.head()
Saving the DataFrame we just created; exporting DataFrame as a csv file:
df_top_hits_XXX.to_csv('df_top_hits_121122.csv')
Let's analyze this DataFrame containing the 50 most popular songs on Spotify as of 11/07/22. Unsurprisingly, most of these songs have a high popularity rating:
pop_median = df_top_hits_110722["popularity"].median()
pop_mean = df_top_hits_110722["popularity"].mean()
pop_min = df_top_hits_110722["popularity"].min()
pop_max = df_top_hits_110722["popularity"].max()
print(f"Median popularity: {pop_median}")
print(f"Mean popularity: {pop_mean}")
print(f"Highest popularity: {pop_max}")
print(f"Lowest popularity: {pop_min}")
In order to understand the 'typical' values for each feature of these top songs, let's do some summary statistics for all variables. This will look a bit different for categorical versus quantitative variables.
df_top_hits_110722['artist'].value_counts()[:10]
Looks like Taylor Swift, Lizzo, Harry Styles, and Doja Cat are tied for most songs by a single artist in the Top Hits Playlist for 11/07/22!
Let's calculate the mean for each of our variables to see the typical values among these most popular songs. Conveniently, Pandas allows us to use the mean() function on the entire DataFrame and it will broadcast it across each column.
df_typicals_110722 = df_top_hits_110722.drop(columns=["name", "album", "artist", "release_date", "explicit"]).mean()
df_typicals_110722
You may have noticed that it doesn't make much sense to find the mean of the explicit variable. This is due to the fact that it is a categorical variable as opposed to quantitative like the rest. To find out if there are more explicit or non-explicit songs in the playlist, we can use value_counts() on this variable.
df_top_hits_110722["explicit"].value_counts().idxmax()
# False -- there are more non-explicit than explicit songs!
The goal is to collect data from the "Today's Top Hits" playlist throughout the coming weeks in order to construct a singular DataFrame containing more rows. This way, we can build a predictive model for popularity that can be compared to the one that we built off of the Spotify Tracks Dataset. We will then be able to make conclusions about the importance of certain features in predicting popularity, what values of k minimizes mean absolute error in our KNeighborsRegressor model, and more.
After weeks of collecting data, our hope was that we would be able to construct a DataFrame with at least a couple hundred rows. This way, we would be able to create a significant predictive model for predicting song popularity that could then be compared to the previous model. However, a month's worth of data from the "Today's Top Hits" playlist was not enough to accomplish our goal:
df_top_hits_110722 = pd.read_csv("df_top_hits_110722.csv")
df_top_hits_111122 = pd.read_csv("df_top_hits_111122.csv")
df_top_hits_111422 = pd.read_csv("df_top_hits_111422.csv")
df_top_hits_112822 = pd.read_csv("df_top_hits_112822.csv")
df_top_hits_120422 = pd.read_csv("df_top_hits_120422.csv")
df_top_hits_121122 = pd.read_csv("df_top_hits_121122.csv")
df_all_weeks = pd.concat([df_top_hits_110722,df_top_hits_111122,df_top_hits_111422,df_top_hits_112822,df_top_hits_120422,df_top_hits_121122]).drop_duplicates().reset_index(drop=True)
df_all_weeks.drop(columns="Unnamed: 0", inplace=True)
df_all_weeks.drop_duplicates(subset=['name']).shape
As seen above, we were only able to construct a DataFrame of 64 songs. This is due to the fact that songs have stayed in this playlist throughout this entire month-long period of collecting the data. Therefore, we are unable to make a significant predictive model on this data. However, it gave us great practice with collecting data using Spotify's API. If the above process were to be repeated throughout the course of a few months/years, a significant model could be created.
Regardless, let's do some EDA on the data that we spent a month collecting. Let's start by analyzing the artists who appeared the most in the "Today's Top Hits" playlist throughout this time period.
artists = df_all_weeks["artist"].value_counts()[0:10].rename_axis('artist').reset_index(name='frequency')
fig = plt.figure(figsize = (20, 8))
plt.bar(artists["artist"], artists["frequency"], color ='green', width = 0.4)
plt.xticks(size=15, rotation=30)
plt.yticks(size=15)
plt.xlabel("Artist", size=13)
plt.ylabel("Count", size=13)
plt.title("Most Persistent Artists in \"Today's Top Hits\" Playlist (Nov-Dec 2022)", size=25)
We can now see the distribution of artists with the most song entries in the playlist during this period.
Let's also take a look at the songs that stayed in the playlist the longest during this period:
songs = df_all_weeks["name"].value_counts()[0:5].rename_axis('name').reset_index(name='frequency')
fig = plt.figure(figsize = (20, 8))
plt.bar(songs["name"], songs["frequency"], color ='red', width = 0.4)
plt.xticks(size=15)
plt.yticks(size=15)
plt.xlabel("Song", size=13)
plt.ylabel("Count", size=13)
plt.title("Most Persistent Songs in \"Today's Top Hits\" Playlist (Nov-Dec 2022)", size=25)
These are the five songs that stayed in the playlist the longest during this period. Since we pulled data 6 times, this means that "Shirt" and "Made You Look" were in the "Today's Top Hits" playlist each time we pulled new data.
Now that we have learned how to pull data from Spotify using their API, translate that data into a pandas DataFrame, perform summary statistics, and create predictive models using k-nearest neighbors linear regression, we will now use these skills to answer the final two questions listed at the start of this notebook:
How has the importance of the above features changed in predicting a song's popularity pre and 'post' COVID-19?
We hypothesize that people are listening to sadder, less positive music after the COVID-19 pandemic. Is this hypothesis supported?
In order to answer these questions, we will be looking at Billboard Top 100 data pre- and post-pandemic. We will use Spotify's API to access the same song features that we have been working with so far, but this time for those in the Top 100 list. We will create two DataFrames––one containing songs on the Billboard charts before the pandemic and one containing songs on the Billboard charts today ('after' the pandemic). We will then build two predictive models––one on the first DataFrame and one on the second––in order to assess the importance of each feature in predicting song popularity. This will be done by determining whether or not the p-value for each feature's weight is statistically significant to the prediciton.
We can load up a csv file of all Billboard Top 100 Chart data since the year 1956 with the following command:
url = "https://raw.githubusercontent.com/HipsterVizNinja/random-data/main/Music/hot-100/Hot%20100.csv"
billboard_df = pd.read_csv(url)
We will first downselect this data to only contain song content from the year 2019. This will act as our 'pre-pandemic' dataset. One thing to keep in mind is that this dataset, like the Spotify Tracks Dataset, contains multiple entries of the same songs depending on how many weeks they were in the Top 100. In order to have each song only be entered once, we will need to create a column that lists the number of times each song was entered into the dataset (i.e., how many weeks it was in the Top 100 during 2019).
# downselecting for 2019 songs
billboard_df_2019 = billboard_df[(billboard_df["chart_date"].str.startswith("2019"))]
billboard_df_2019 = billboard_df_2019.sort_values(by=['song'])
# need to sort the df so it is in the order of the value_counts series
# finding number of times each song appears on the charts in 2019
longevity_series = billboard_df_2019["song"].value_counts()
# making every song only appear once
billboard_df_2019 = billboard_df_2019.drop_duplicates(subset=['song'])
# list of each song and how many times it was on the charts in 2019
longevity_series
Now that we have stored these numbers for each of the songs in this dataset into a variable longevity_series, we can sort the DataFrame to match up with this ordering:
from pandas.api.types import CategoricalDtype
# sorting the dataframe to match the order of the longevity_series
cat_song_order = CategoricalDtype(
longevity_series.index,
ordered=True
)
billboard_df_2019['song'] = billboard_df_2019['song'].astype(cat_song_order)
billboard_df_2019 = billboard_df_2019.sort_values('song')
# creating variables of the desired order for songs and artists
desired_song_order_list = billboard_df_2019["song"].values
desired_artist_order_list = billboard_df_2019["performer"].values
We have also stored the correct ordering of the songs into a variable called desired_song_order_list. Not only does this give us the desired order for the songs in our DataFrame, it can be easily used with Spotify's API to get track information for each song in the list.
billboard_df_2019.head() # 2019 billboard data, now in order of most to least # of times on Chart in 2019
We can see with the command below that we hope to obtain information on 621 songs.
# we have 621 songs we'd like to get spotify api info on
len(desired_song_order_list)
Let's iterate through each song in this list and attempt to fetch song IDs.
# using spotify's api to get this info
track_id_list = []
for track in desired_song_order_list:
try:
track_json = sp.search(q='track:' + track, type='track')
track_id = track_json['tracks']['items'][0]['id']
track_id_list.append(track_id)
except IndexError:
track_id = "N/A"
print(track)
We can see that we were unable to fetch the song ID for two songs: "Can't Leave Without It" and "Don't Check On Me." We can remove these from the longevity_series variable so that we can eventually add this Series to our DataFrame, and it will accurately indicate how many times each song appeared on the Charts in 2019.
(If we did not remove these songs from the Series and simply added the Series to the DataFrame we are about to create, the values would not line up correctly with each song because we will only be able to make a DataFrame of 619 songs).
Let's drop these two songs from longevity_series, checking the length of the series before and after to make sure we actually got rid of them:
print(len(longevity_series))
longevity_series = longevity_series.drop(labels=["Can't Leave Without It", "Don't Check On Me"])
print(len(longevity_series))
We are now ready to utilize Spotify's API to make a DataFrame called df_billboard_2019. This DataFrame will contain each song that appeared on the Billboard Top 100 in 2019 and the exact same features we have been looking at since the beginning of this tutorial. We can use the getTrackFeatures() function that we previously made to easily fetch all of this track information.
# creating a dataframe for the spotify api info on each of these songIDs
df_billboard_2019 = pd.DataFrame(columns=['name', 'album', 'artist', 'release_date', 'length', 'popularity', 'explicit', 'acousticness',
'danceability', 'energy', 'instrumentalness', 'liveness', 'loudness', 'speechiness',
'tempo', 'time_signature', 'key', 'mode', 'valence'])
for id in track_id_list:
track_list = getTrackFeatures(id)
df = pd.DataFrame(data=[track_list], columns=['name', 'album', 'artist', 'release_date', 'length', 'popularity', 'explicit', 'acousticness',
'danceability', 'energy', 'instrumentalness', 'liveness', 'loudness', 'speechiness',
'tempo', 'time_signature', 'key', 'mode', 'valence'])
df_billboard_2019 = pd.concat([df_billboard_2019, df], ignore_index=True)
df_billboard_2019.head()
Let's drop any rows with NaNs and check that this DataFrame has the expected number of rows:
df_billboard_2019 = df_billboard_2019.dropna()
len(df_billboard_2019)
Lastly, let's add a column "Number of Times on Chart" and set the values to be those of the longevity_series Series. Since we previously dropped the two songs that we were unable to find song IDs for, this Series can be easily added as a column, as it contains the right amount of elements.
df_billboard_2019["Number of Times on Chart"] = longevity_series.values
df_billboard_2019.head()
import seaborn as sns
import matplotlib.pyplot as plt
fig, ax = plt.subplots(figsize=(8, 8))
sns.histplot(data=df_billboard_2019["Number of Times on Chart"], ax=ax, color='green')
plt.title("2019 Billboard Chart Longevity")
plt.show()
We can see from this histogram that there were a lot of songs in 2019 that were only on the Billboard Top 100 for less than 2 weeks. In contrast, there were few songs that remained on the chart for more than 25 weeks.
Our DataFrame looks great. Now, let's save it as a csv file:
# saving this df:
df_billboard_2019.to_csv('df_billboard_2019.csv')
To avoid having to rerun the above code (as it takes a long time to run) we can just read in this DataFrame while working on this project:
df_billboard_2019 = pd.read_csv("df_billboard_2019.csv")
Before we make a predictive model for song popularity using this dataset, let's first construct our "post-pandemic" dataset. We will follow the same steps as before, except we will initially downselect our dataset to only contain songs from 2022.
# downselecting for 2022 songs
billboard_df_2022 = billboard_df[(billboard_df["chart_date"].str.startswith("2022"))]
billboard_df_2022 = billboard_df_2022.sort_values(by=['song'])
# need to sort the df so it is in the order of the value_counts series
# finding number of times each song appears on the charts in 2022
longevity_series_22 = billboard_df_2022["song"].value_counts()
# making every song only appear once
billboard_df_2022 = billboard_df_2022.drop_duplicates(subset=['song'])
# list of each song and how many times it was on the charts in 2022
longevity_series_22
from pandas.api.types import CategoricalDtype
# sorting the dataframe to match the order of the longevity_series
cat_song_order = CategoricalDtype(
longevity_series_22.index,
ordered=True
)
billboard_df_2022['song'] = billboard_df_2022['song'].astype(cat_song_order)
billboard_df_2022 = billboard_df_2022.sort_values('song')
# creating variables of the desired order for songs and artists
desired_song_order_list_22 = billboard_df_2022["song"].values
desired_artist_order_list_22 = billboard_df_2022["performer"].values
billboard_df_2022.head() # 2022 billboard data, now in order of most to least # of times on Chart in 2022
len(desired_song_order_list_22)
# using spotify's api to get this info
track_id_22_list = []
for track in desired_song_order_list_22:
try:
track_json = sp.search(q='track:' + track, type='track')
track_id = track_json['tracks']['items'][0]['id']
track_id_22_list.append(track_id)
except IndexError:
track_id = "N/A"
print(track)
print(len(longevity_series_22))
longevity_series_22 = longevity_series_22.drop(labels=["Don't Think Jesus", "Christmas Tree Farm (Old Timey Version)", "I'd Do Anything To Make You Smile", "Oklahoma Smoke Show"])
print(len(longevity_series_22))
# creating a dataframe for the spotify api info on each of these songIDs
df_billboard_2022 = pd.DataFrame(columns=['name', 'album', 'artist', 'release_date', 'length', 'popularity', 'explicit', 'acousticness',
'danceability', 'energy', 'instrumentalness', 'liveness', 'loudness', 'speechiness',
'tempo', 'time_signature', 'key', 'mode', 'valence'])
for id in track_id_22_list:
track_list = getTrackFeatures(id)
df = pd.DataFrame(data=[track_list], columns=['name', 'album', 'artist', 'release_date', 'length', 'popularity', 'explicit', 'acousticness',
'danceability', 'energy', 'instrumentalness', 'liveness', 'loudness', 'speechiness',
'tempo', 'time_signature', 'key', 'mode', 'valence'])
df_billboard_2022 = pd.concat([df_billboard_2022, df], ignore_index=True)
df_billboard_2022.head()
df_billboard_2022 = df_billboard_2022.dropna()
len(df_billboard_2022)
df_billboard_2022["Number of Times on Chart"] = longevity_series_22.values
df_billboard_2022.head()
# saving this df:
df_billboard_2022.to_csv('df_billboard_2022.csv')
We can now just read in this DataFrame while working on this project:
df_billboard_2022 = pd.read_csv("df_billboard_2022.csv")
df_billboard_2022_copy = df_billboard_2022.copy()
We are going to first construct a model that will predict the number of times a song appeared on the Billboard Top 100 Chart in 2019 using the DataFrame df_billboard_2019. We will need to determine which features are the most important in predicting this label in order to get an accurate model. Once we know these features, we can best predict the number of times a song appeared on the Charts. We will repeat this whole process for the DataFrame df_billboard_2022, and see how the p-values have changed before and after the pandemic.
Let's start with the pre-pandemic data. We will train k-nearest neighbors regression models to predict the number of times a song was on the Charts in 2019 for different values of k. We will validate our model using 10-fold cross validation, and calculate the training and validation MAE of each model to determine the best value for k.
There are many variables in our datasets, some of which will not be very helpful in constructing a predictive model. We must investigate which variables will act as the best features for predicting our label of "Number of Times on Chart". To do this, let's use a Python library known as statsmodels to calculate each quantitative variable's p-value.
import statsmodels.api as sm
# making a copy of dataframe
df_billboard_2019_copy = df_billboard_2019.copy()
potential_features = ['length', 'popularity', 'acousticness', 'danceability',
'energy', 'instrumentalness', 'liveness', 'loudness',
'speechiness', 'tempo', 'time_signature', 'key', 'mode', 'valence']
# define predictor and response variables
y = df_billboard_2019_copy["Number of Times on Chart"]
x = df_billboard_2019_copy[potential_features]
# add constant to predictor variables
x = sm.add_constant(x)
# fit linear regression model
model = sm.OLS(y, x).fit()
p_vals_2019 = []
for x in range (0, 14):
p_vals_2019.append(model.pvalues[x])
print(f"The p-value for {potential_features[x]} is {model.pvalues[x]}")
A p-value less than 0.05 is considered to be statistically significant. Let's iterate through the p_vals list we just created and extract only the p-values less than 0.05.
desired_features_2019 = []
for p in range(0, 14):
if p_vals_2019[p] < 0.05:
desired_features_2019.append(potential_features[p])
print(f"{potential_features[p]}: {p_vals_2019[p]}")
Now that we know which variables have the most statistically significant p-values, we can use these as features for our model to predict "Number of Times on Chart".
from sklearn.feature_extraction import DictVectorizer
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsRegressor
from sklearn.pipeline import Pipeline
from sklearn.model_selection import cross_validate
import matplotlib.pyplot as plt
from sklearn.model_selection import cross_val_score
# get the features (in dict format) and the labels
# (do not split into training and validation sets)
x_dict = df_billboard_2019_copy[desired_features_2019].to_dict(orient="records")
y = df_billboard_2019_copy["Number of Times on Chart"]
# specify the pipeline
vec = DictVectorizer(sparse=False)
scaler = StandardScaler()
# options for k
k_opts = [i for i in range(1, 50, 1)]
# empty list of validation errors
validation_maes = []
# empty list of training errors
training_maes = []
# creating each model, fitting to training set, predicting y_val, calculating MAE, and appending to list
for k in k_opts:
model = KNeighborsRegressor(n_neighbors=k)
pipeline = Pipeline([("vectorizer", vec), ("scaler", scaler), ("fit", model)])
scores = cross_validate(pipeline, x_dict, y, cv=10, scoring="neg_mean_absolute_error", return_train_score=True)
average_validation_mae = -scores["test_score"].mean()
validation_maes.append(average_validation_mae)
average_training_mae = -scores["train_score"].mean()
training_maes.append(average_training_mae)
# plotting
fig, ax = plt.subplots(1, 1, figsize=(8, 8))
ax.plot(k_opts, validation_maes, marker="o",linestyle="-", markersize=5, color="red", markeredgecolor="white", alpha=0.8)
ax.plot(k_opts, training_maes, marker="o",linestyle="-", markersize=5, color="blue", markeredgecolor="white", alpha=0.8)
ax.set_xlabel('k', size=25)
ax.set_ylabel('Mean Absolute Error (MAE)', size=25)
ax.set_title('Training & Validation MAE of KNN Models', size=30)
ax.legend(['Validation MAE', 'Training MAE'])
We can see from the graph that the training MAE is smaller than the validation MAE. This makes sense because it is harder for our model to predict data it has not seen before than data it has already seen. As the value of k increases, the training MAE hovers around 6, while the validation MAE hovers around 8. This means that our model is typically around 6 weeks off when predicting the number of weeks a song has been on the Top 100 Charts for data it has already seen, and typically around 8 weeks off for data it has not yet seen.
Now, let's determine the best value for k (i.e., the value of k that minimizes the validation MAE):
vec = DictVectorizer(sparse=False)
scaler = StandardScaler()
x_dict = df_billboard_2019_copy[desired_features_2019].to_dict(orient="records")
y = df_billboard_2019_copy["Number of Times on Chart"]
def get_cv_error(k):
model = KNeighborsRegressor(n_neighbors=k)
pipeline = Pipeline([("vectorizer", vec), ("scaler", scaler), ("fit", model)])
mae = np.mean(-cross_val_score(
pipeline, x_dict, y,
cv=10, scoring="neg_mean_absolute_error"
))
return mae
from matplotlib import pyplot as plt
ks = pd.Series(range(1, 51))
ks.index = range(1, 51)
test_errs = ks.apply(get_cv_error)
test_errs.plot.line(figsize=(8,8), color='green')
plt.title("Validation MAE as a Function of K", size=30)
plt.xlabel("k", size=25)
plt.ylabel("Validation MAE", size=25)
print(f"The best value for k is {test_errs.idxmin()}, where the Validation MAE is {test_errs[test_errs.idxmin()]}.")
On average, our model is around 7 and a half weeks off when predicting the number of times a song was on the charts in 2019. Now, let's create a predictive model on the 2022 DataFrame. We will be utilizing the popular machine learning algorithm XGBoost to create this model as opposed to repeating KNN.
import xgboost as xgb
import numpy as np
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
x = df_billboard_2022_copy[potential_features]
y = df_billboard_2022_copy["Number of Times on Chart"]
data_dmatrix = xgb.DMatrix(data=x,label=y)
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=123)
xg_reg = xgb.XGBRegressor(objective ='reg:linear', colsample_bytree = 0.7, learning_rate = 0.1,
max_depth = 5, alpha = 10, n_estimators = 25)
xg_reg.fit(x_train,y_train)
preds = xg_reg.predict(x_test)
rmse = np.sqrt(mean_squared_error(y_test, preds))
print("RMSE: %f" % (rmse))
With a bit of hyper-parameter tuning, we were able to get a RMSE of around 7. This means that our model is around 7 weeks off on average when predicting the number of times a song was on the Billboard charts in 2022.
params = {"objective":"reg:linear",'colsample_bytree': 0.3,'learning_rate': 0.1,
'max_depth': 5, 'alpha': 10}
cv_results = xgb.cv(dtrain=data_dmatrix, params=params, nfold=3,
num_boost_round=50,early_stopping_rounds=10,metrics="rmse", as_pandas=True, seed=123)
print((cv_results["test-rmse-mean"]).tail(1))
Through 3-fold cross validation, we obtained an RMSE of 7.556 weeks. This error is very similar to the error of the model that we built on pre-COVID data using K-Nearest Neighbors Regression.
Now that we have built predictive models on the pre- and 'post'- COVID-19 datasets, let's move our analysis to the change in beta coefficients. In order to analyze how the importance of each variable has changed, we need to fit linear regression models to our 2019 and 2022 data individually. We will do so using SciKit Learn. This will allow us to retrieve the coefficients of each variable in relation to the dependent variable of "Number of TImes on Chart".
Let's start by obtaining the coefficients for the 2019 data.
# using sklearn's LinearRegression to determine coefficients for features in 2019
from sklearn.linear_model import LinearRegression
x_2019 = df_billboard_2019_copy[['length', 'popularity', 'acousticness', 'danceability',
'energy', 'instrumentalness', 'liveness', 'loudness',
'speechiness', 'tempo', 'time_signature', 'key', 'mode', 'valence']]
x_2019_sc = (x_2019 - x_2019.mean()) / x_2019.std()
y_2019 = df_billboard_2019_copy["Number of Times on Chart"]
reg = LinearRegression().fit(x_2019_sc, y_2019)
reg.predict(x_2019)
reg.coef_
betas_2019 = pd.DataFrame({"Feature":x_2019.columns.tolist(),"Coefficients":reg.coef_})
betas_2019
We can now see the relationship between each feature and the variable "Number of Times on Chart" for 2019. Note that the sign of the coefficient indicates either a positive or negative relationship. If the coefficient is positive, that means that as that variable increases, we also tend to see an increase in "Number of Times on Chart". But if the coefficient is negative, we see a decrease in "Number of TImes on Chart" as that variable increases.
Let's retrieve the coefficients for the 2022 data.
# using sklearn's LinearRegression to determine coefficients for features in 2022
from sklearn.linear_model import LinearRegression
x_2022 = df_billboard_2022_copy[['length', 'popularity', 'acousticness', 'danceability',
'energy', 'instrumentalness', 'liveness', 'loudness',
'speechiness', 'tempo', 'time_signature', 'key', 'mode', 'valence']]
x_2022_sc = (x_2022 - x_2022.mean()) / x_2022.std()
y_2022 = df_billboard_2022_copy["Number of Times on Chart"]
reg = LinearRegression().fit(x_2022_sc, y_2022)
reg.coef_
betas_2022 = pd.DataFrame({"Feature":x_2022.columns.tolist(),"Coefficients":reg.coef_})
We can now combine the data into one DataFrame so we can easily see how the coefficients have changed from 2019 to 2022.
betas = betas_2019.merge(betas_2022, on="Feature")
betas.rename(columns={"Coefficients_x":"Coefficients 2019", "Coefficients_y":"Coefficients 2022"}, inplace=True)
betas["Change from 2019 to 2022"] = betas["Coefficients 2022"] - betas["Coefficients 2019"]
betas
The goal of doing this is to understand which features have changed in importance for predicting "Number of Times on Chart". But how do we define significant "change"? For our purposes, we are interested in determining which features' coefficients have changed sign from 2019 to 2022 because this shows a direct change in relationship to "Number of Times on Chart".
Let's now graphically represent the change in coefficients from 2019 to 2022.
betas.plot(x="Feature", y=["Coefficients 2019", "Coefficients 2022"], ylabel="Importance",
fontsize=15, kind="bar", figsize=(20,8), title="Change in Coefficient Values Pre- and 'Post'- COVID-19")
The graph makes it clear that the coefficients of the features danceability, instrumentalness, time_signature, key, mode, and valence have changed sign from 2019 to 2022.
Recall that our hypothesis is that people are generally listening to sadder, less positive music after the COVID-19 pandemic. To assess the validity of this hypothesis, let's take a deeper look at how the variables danceability, mode, and valence really changed from 2019 to 2022.
dance = betas[(betas["Feature"] == "danceability")]
mode = betas[(betas["Feature"] == "mode")]
valence = betas[(betas["Feature"] == "valence")]
important = pd.concat([dance, mode, valence], axis=0)
important.plot(x="Feature", y=["Coefficients 2019", "Coefficients 2022"], ylabel="Importance",
fontsize=15, kind="bar", figsize=(20,8), title="Change in Coefficient Values Pre- and 'Post'- COVID-19")
We can clearly see that these variables have changed from having a positive relationship with "Number of Times on Chart" in 2019 to a negative relationship in 2022. Recall the definitions of these variables at the beginning of this notebook. We can clearly see that danceability and valence have a positive connotation when it comes to music––but mode needs a bit more explaining.
As described, mode "indicates the modality (major or minor) of a track, the type of scale from which its melodic content is derived. Major is represented by 1 and minor is 0." Before COVID-19, we were seeing a positive relationship between mode and "Number of Times on Chart". This means that songs with a major modality were more strongly associated with predicting "Number of Times on Chart". However, after the pandemic we see the opposite relationship. This means that we are seeing more songs with a minor modality predicting "Number of Times on Chart".
Songs with a minor key usually evoke feelings of sadness. Therefore, this change in the coefficient for mode supports our hypothesis. Additionally, danceability and valence changing from having a positive to negative relationship with "Number of Times on Chart" further supports our hypothesis. Overall, our hypothesis is generally supported based off of these variables.
Throughout this notebook, we have learned key data science concepts for collecting, analyzing, and representing data.
%%shell
jupyter nbconvert --to html /content/Milestone2.ipynb