Data Collection

In this notebook will collect all of the data required for this project including:

Top ten singles by year from Wikipedia
Matching song entities using the Spotify API with the Spotipy package

import pandas as pd
import numpy as np
import requests

from tqdm import tqdm

Collect Top Ten Singles by Year

The top ten singles are available from 1958 to 2022 using a predictable URL pattern.

urls = []
for year in range(1958, 2023):
    urls.append(f"https://en.wikipedia.org/wiki/List_of_Billboard_Hot_100_top-ten_singles_in_{year}")

urls[20]

'https://en.wikipedia.org/wiki/List_of_Billboard_Hot_100_top-ten_singles_in_1978'

rows = []
for url, year in zip(urls, range(1958, 2023)):
    print(year)
    dfs = pd.read_html(url)

    row_num = []
    for df in dfs:
        row_num.append(df.shape[0])
    
    year_df = dfs[row_num.index(max(row_num))]
    year_df = year_df.iloc[:,:6]

    columns = ['entry_date','title','artist','peak','peak_date','weeks_top_ten']
    year_df.columns = columns

    year_df = year_df[~pd.to_numeric(year_df['peak'], errors='coerce').isna()].reset_index(drop=True)
    year_df['year'] = year
    rows.extend(year_df.to_dict(orient='records'))

df = pd.DataFrame(rows)

df.sample(20)

	entry_date	title	artist	peak	peak_date	weeks_top_ten	year
3173	September 22	"Praying for Time"	George Michael	1	October 13	6	1990
4658	August 1	"The Hills"[G] (#10)	The Weeknd	1	October 3	21	2015
3880	November 10	"Only Time"	Enya	10	November 10	3	2001
850	September 10	"Bus Stop"	The Hollies	5	September 17	4	1966
3351	August 29	"Humpin' Around"	Bobby Brown	3	September 12	8	1992
4079	July 2	"Inside Your Heaven"	Carrie Underwood	1	July 2	4	2005
4027	July 10	"I Believe"	Fantasia	1	July 10	2	2004
1766	September 13	"Wasted Days and Wasted Nights"	Freddy Fender	8	September 27	3	1975
3235	April 20	"Here We Go (Let's Rock & Roll)"	C+C Music Factory	3	May 18	6	1991
1204	February 14	"Arizona"	Mark Lindsay	10	February 14	1	1970
281	July 17	"San Antonio Rose"	Floyd Cramer	8	July 17	1	1961
228	December 31[1]	"Angel Baby"	Rosie and the Originals	5	January 23	7	1961
2867	January 16	"Could've Been"	Tiffany	1	February 6	6	1988
4877	October 26	"Bandit"	Juice Wrld and YoungBoy Never Broke Again	10	October 26	1	2019
2707	August 9	"Higher Love"	Steve Winwood	1	August 30	6	1986
4422	December 11	"Grenade" (#6)	Bruno Mars	1	January 8	17	2011
4170	November 4	"Come to Me"	Diddy featuring Nicole Scherzinger	9	November 4	1	2006
1288	January 9	"Lonely Days"	Bee Gees	3	January 30	7	1971
1592	March 2	"Dark Lady"	Cher	1	March 23	6	1974
862	October 15	"Walk Away Renée"	The Left Banke	5	October 29	4	1966

Find Matching Spotify Songs

import spotipy
from spotipy.oauth2 import SpotifyClientCredentials

spotify = spotipy.Spotify(client_credentials_manager=SpotifyClientCredentials())

Some of the song titles had backslashes in them, which caused issues with createing the URLs for the Spotify API. We will get rid of all slashes to solve this problem.

df['title'] = df['title'].str.replace("/","", regex=False)
df['title'] = df['title'].str.replace("\\","", regex=False)
df['title'] = df['title'].str.replace("(","", regex=False)
df['title'] = df['title'].str.replace(")","", regex=False)
df['title'] = df['title'].str.replace("\'","", regex=False)
df['title'] = df['title'].str.replace(":","", regex=False)
df['title'] = df['title'].str.replace(".","", regex=False)

list_df = df.to_dict(orient='records')

list_df[0]

{'entry_date': 'August 4',
 'title': '"Poor Little Fool"',
 'artist': 'Ricky Nelson',
 'peak': 1,
 'peak_date': 'August 4',
 'weeks_top_ten': 6,
 'year': 1958}

for item in tqdm(list_df):
    try:
        result = spotify.search(f"""track:{item['title']} artist:{item['artist']}""", type="track", limit=1)
        if len(result['tracks']['items']) > 0:
            item['spotify_uri'] = result['tracks']['items'][0]['uri']
        else:
            item['spotify_uri'] = np.nan
    except spotipy.client.SpotifyException as e:
        item['spotify_uri'] = str(e.http_status) + " - " + e.msg

 72%|███████▏  | 3644/5050 [06:11<02:14, 10.44it/s]HTTP Error for GET to https://api.spotify.com/v1/search with Params: {'q': 'track:"Been Around the World"  "Its All About the Benjamins" artist:Puff Daddy featuring The Notorious B.I.G. and Mase', 'limit': 1, 'offset': 0, 'type': 'track', 'market': None} returned 404 due to Not found.
 78%|███████▊  | 3914/5050 [06:38<01:44, 10.85it/s]HTTP Error for GET to https://api.spotify.com/v1/search with Params: {'q': 'track:"I Need a Girl Part Two" artist:P. Diddy featuring Tammy Ruggieri, Ginuwine, Loon, and Mario Winans', 'limit': 1, 'offset': 0, 'type': 'track', 'market': None} returned 404 due to Not found.
 78%|███████▊  | 3916/5050 [06:38<01:37, 11.61it/s]HTTP Error for GET to https://api.spotify.com/v1/search with Params: {'q': 'track:"Down 4 U" artist:Irv Gotti presents The Inc. featuring Ja Rule, Ashanti, Charli Baltimore and Vita', 'limit': 1, 'offset': 0, 'type': 'track', 'market': None} returned 404 due to Not found.
 99%|█████████▉| 5013/5050 [08:27<00:03, 10.12it/s]HTTP Error for GET to https://api.spotify.com/v1/search with Params: {'q': 'track:"We Dont Talk About Bruno" artist:Carolina Gaitán, Mauro Castillo, Adassa, Rhenzy Feliz, Diane Guerrero, Stephanie Beatriz and the Encanto cast', 'limit': 1, 'offset': 0, 'type': 'track', 'market': None} returned 404 due to Not found.
100%|██████████| 5050/5050 [08:30<00:00,  9.89it/s]

Small cleaning

Not all of our songs had matches, let’s remove those.

df = pd.DataFrame(list_df)

df = df.dropna()

df = df[~df['spotify_uri'].str.contains("Not found")]

df.to_csv('songs.csv', index=False)

Extract Song Features

df = pd.read_csv("songs.csv")

df.head()

	entry_date	title	artist	peak	peak_date	weeks_top_ten	year	spotify_uri
0	August 4	"Poor Little Fool"	Ricky Nelson	1	August 4	6	1958	spotify:track:5ayybTSXNwcarDtxQKqvWX
1	August 4	"Patricia"	Pérez Prado	2	August 4	6	1958	spotify:track:2bwhOdCOLgQ8v6xStAqnju
2	August 4	"Splish Splash"	Bobby Darin	3	August 4	3	1958	spotify:track:40fD7ct05FvQHLdQTgJelG
3	August 4	"Hard Headed Woman"	Elvis Presley	4	August 4	2	1958	spotify:track:3SU1TXJtAsf8jCKdUeYy53
4	August 4	"When"	Kalin Twins	5	August 4	5	1958	spotify:track:3HZJ9BLBpDya4p71VfXSWp

list_df = df.to_dict(orient='records')

spotify.audio_features(list_df[88]['spotify_uri'])[0]

item

{'entry_date': 'July 2',
 'title': '"Falling Back"',
 'artist': 'Drake',
 'peak': 7,
 'peak_date': 'July 2',
 'weeks_top_ten': '1*',
 'year': 2022,
 'spotify_uri': 'spotify:track:1vbn9fEyw1IYhqgZJdu9ZB'}

new_list = []
for item in tqdm(list_df):
    feature_dict = spotify.audio_features(item['spotify_uri'])[0]
    if feature_dict is not None:
        combo_dict = {**item, **feature_dict}
    else:
        combo_dict = item
    new_list.append(combo_dict)

100%|██████████| 4251/4251 [05:27<00:00, 12.97it/s]

df = pd.DataFrame(new_list)
df.head()

	entry_date	title	artist	peak	peak_date	weeks_top_ten	year	spotify_uri	danceability	energy	...	liveness	valence	tempo	type	id	uri	track_href	analysis_url	duration_ms	time_signature
0	August 4	"Poor Little Fool"	Ricky Nelson	1	August 4	6	1958	spotify:track:5ayybTSXNwcarDtxQKqvWX	0.474	0.338	...	0.1300	0.810	154.596	audio_features	5ayybTSXNwcarDtxQKqvWX	spotify:track:5ayybTSXNwcarDtxQKqvWX	https://api.spotify.com/v1/tracks/5ayybTSXNwca...	https://api.spotify.com/v1/audio-analysis/5ayy...	153933.0	4.0
1	August 4	"Patricia"	Pérez Prado	2	August 4	6	1958	spotify:track:2bwhOdCOLgQ8v6xStAqnju	0.699	0.715	...	0.0704	0.810	137.373	audio_features	2bwhOdCOLgQ8v6xStAqnju	spotify:track:2bwhOdCOLgQ8v6xStAqnju	https://api.spotify.com/v1/tracks/2bwhOdCOLgQ8...	https://api.spotify.com/v1/audio-analysis/2bwh...	140000.0	4.0
2	August 4	"Splish Splash"	Bobby Darin	3	August 4	3	1958	spotify:track:40fD7ct05FvQHLdQTgJelG	0.645	0.943	...	0.3700	0.965	147.768	audio_features	40fD7ct05FvQHLdQTgJelG	spotify:track:40fD7ct05FvQHLdQTgJelG	https://api.spotify.com/v1/tracks/40fD7ct05FvQ...	https://api.spotify.com/v1/audio-analysis/40fD...	131720.0	4.0
3	August 4	"Hard Headed Woman"	Elvis Presley	4	August 4	2	1958	spotify:track:3SU1TXJtAsf8jCKdUeYy53	0.616	0.877	...	0.1840	0.919	97.757	audio_features	3SU1TXJtAsf8jCKdUeYy53	spotify:track:3SU1TXJtAsf8jCKdUeYy53	https://api.spotify.com/v1/tracks/3SU1TXJtAsf8...	https://api.spotify.com/v1/audio-analysis/3SU1...	114240.0	4.0
4	August 4	"When"	Kalin Twins	5	August 4	5	1958	spotify:track:3HZJ9BLBpDya4p71VfXSWp	0.666	0.468	...	0.1190	0.946	93.018	audio_features	3HZJ9BLBpDya4p71VfXSWp	spotify:track:3HZJ9BLBpDya4p71VfXSWp	https://api.spotify.com/v1/tracks/3HZJ9BLBpDya...	https://api.spotify.com/v1/audio-analysis/3HZJ...	146573.0	4.0

5 rows × 26 columns

df.to_csv('songs.csv', index=False)

Created in deepnote.com Created in Deepnote