I'm attempting to get the audio feature data for about 4.5 years worth of Spotify Top 200 Charts. It's for 68 countries + global ranking, so about 20 million records in all. I'm querying a SQL Lite database with all of that data.
This is prep for a data analysis project and I've currently limited my scope to just pulling the 3rd Friday of every month because the fastest time I could get pulling an entire day's worth of audio features for the charts is 15.8 minutes. That's 18.5 days of straight processing to get all 1701 days.
Does anyone see any way I could make this faster? I'm currently calling the spotipy.audio_features() function for each track id. The function is limited to 100 ids and I'm not so sure that would be much faster anyway.
Here's an example entry before processing:
column_names = ['title', 'rank', 'date', 'artist', 'url', 'region', 'chart', 'trend', 'streams']
('You Were Right', 179, '2017-01-20', 'RÜFÜS DU SOL', 'https://open.spotify.com/track/77lqbary6vt1DSc1MBN6sx', 'Australia', 'top200', 'NEW_ENTRY', 14781)
And after processing:
column_names = ['title', 'rank', 'date', 'artist', 'url', 'region', 'chart', 'trend', 'streams', 'track_id', 'danceability', 'energy', 'key', 'loudness', 'mode', 'speechiness', 'acousticness', 'instrumentalness', 'liveness', 'valence', 'tempo', 'duration_ms', 'time_signature']
('You Were Right', 179, '2017-01-20', 'RÜFÜS DU SOL', 'https://open.spotify.com/track/77lqbary6vt1DSc1MBN6sx', 'Australia', 'top200', 'NEW_ENTRY', 14781, '77lqbary6vt1DSc1MBN6sx', 0.708, 0.793, 5, -5.426, 0, 0.0342, 0.0136, 0.00221, 0.118, 0.734, 122.006, 239418, 4)
Full Script:
import sqlite3
import os
import spotipy
import numpy as np
import pandas as pd
from spotipy.oauth2 import SpotifyClientCredentials
from requests.exceptions import ReadTimeout
from datetime import datetime
"""Gets the third Friday of each month and checks that the date exists in the database."""
def date_range_checker(cursor, start_date, end_date):
# Put in the range for that year. It's till 2021.
date_range = pd.date_range(start_date, end_date ,freq='WOM-3FRI')
cursor.execute("""SELECT DISTINCT Date(date) FROM charts""")
sql_date_fetch = cursor.fetchall()
sql_dates = [r[0] for r in sql_date_fetch]
validated_dates = []
for date in date_range:
# print(str(date)[0:-9])
if str(date)[0:-9] in sql_dates:
validated_dates.append(str(date)[0:-9])
return validated_dates
"""Connects to the database. For each date in validated_dates, it queries all the records with that date.
Then splits the track IDs from the Spotify link into a new list of tuples. Then for each tuple in that list, it calls the Spotify API with the track ID.
Finally it creates a numpy array for the entire list so the csv converter can be used."""
def main():
now_start = datetime.now()
start_time = now_start.strftime("%H:%M:%S")
print(f'Main Function - start time: {start_time}')
""""This script queries """
print("working on it...")
dbname = 'charts.db'
if os.path.exists(dbname):
db = sqlite3.connect(dbname, isolation_level=None)
cursor = db.cursor()
""""Selects 3rd friday of the month because it takes about 15.8 minutes per day. That's 14.2 hours total to get one friday a month for all 4.5 years.
Or 18.6 full days of processing for every single day for all 1701 days.
Fridays are a preferable release day in the industry. Cite this later."""
# Date range list created and checked in this function
validated_dates = date_range_checker(cursor, '2017-02-01', '2017-12-31') # change year here
column_names = ['title', 'rank', 'date', 'artist', 'url', 'region', 'chart', 'trend', 'streams', 'track_id', 'danceability',
'energy', 'key', 'loudness', 'mode', 'speechiness', 'acousticness', 'instrumentalness', 'liveness', 'valence', 'tempo',
'duration_ms', 'time_signature']
for date_chosen in validated_dates:
cursor.execute("""SELECT * FROM charts WHERE Date("date") = ?""", (date_chosen,))
db_result = cursor.fetchall()
data_with_track_ids = []
final_data = []
# Splits ID from Spotify link.
for entry in db_result:
track_id = entry[4].split('/')[-1]
entry += (track_id,)
data_with_track_ids.append(entry)
print("I've got all the track IDs. Will start calls to Spotify API now.")
# Calls to spotify with the new extracted track_id
for entry in data_with_track_ids:
track_id = entry[-1]
try:
audio_features = spotify.audio_features(track_id)
except ReadTimeout:
print('Spotify timed out... trying again...')
audio_features = spotify.audio_features(track_id)
entry += (audio_features[0]['danceability'], audio_features[0]['energy'], audio_features[0]['key'],
audio_features[0]['loudness'], audio_features[0]['mode'], audio_features[0]['speechiness'], audio_features[0]['acousticness'],
audio_features[0]['instrumentalness'], audio_features[0]['liveness'],
audio_features[0]['valence'], audio_features[0]['tempo'], audio_features[0]['duration_ms'], audio_features[0]['time_signature'])
final_data.append(entry)
np_data = np.array(final_data)
my_dataframe = pd.DataFrame(np_data, columns=column_names)
my_dataframe.to_csv(f'spotify_csv_data/spotify_top_200 {date_chosen}.csv')
now_end = datetime.now()
end_time = now_end.strftime("%H:%M:%S")
print(f'Main Function - Start time: {start_time}. End time: {end_time}.')
print(f'The date {date_chosen} took {now_end - now_start} to run.')
db.close()
if __name__ == "__main__":
now_start = datetime.now()
start_time = now_start.strftime("%H:%M:%S")
print(f'Script - start time: {start_time}')
os.environ['SPOTIPY_CLIENT_ID'] = 'ENTER YOUR CLIENT_ID'
os.environ['SPOTIPY_CLIENT_SECRET'] = 'ENTER YOUR CLIENT_SECRET'
# Allows for retries. Seems to be enough that it doesn't crash.
spotify = spotipy.Spotify(client_credentials_manager=SpotifyClientCredentials(), requests_timeout=10, retries=10)
"""Leave above set."""
main()
now_end = datetime.now()
end_time = now_end.strftime("%H:%M:%S")
print(f'Script - Start time: {start_time}. End time: {end_time}.')
print(f'This script took {now_end - now_start} to run.\n')
A few ideas to improve performance:
Use parallel processing
Since you are using Python, the code running is single threaded.
Using Python's multiprocessing library, you could (for example) run 4 instances of the same code but with evenly divided start/end dates. This can make your data processing ~4x faster. You would just need to write the data in such a way that there is no overlap.
Note: If you are rate limited by the Spotify API (you most likely will be), you can use different API keys for each instance. (Make multiple accounts or borrow a friends API key).
Sql query optimizations
It's worth looking into your queries to see what is going wrong. I'm personally not familiar with SQL, just giving you ideas.
Profile your program to understand more.
See How can you profile a Python script?
Use some sort of caching technique to avoid redundant api calls and to avoid populating duplicate data. (See a potential solution below, in last block of code using ids_seen)
python3
# Splits ID from Spotify link.
for entry in db_result:
track_id = entry[4].split('/')[-1]
entry += (track_id,)
data_with_track_ids.append(entry)
In this code, what type is entry? How big is db_result?
Another thing worth mentioning regarding your following code:
python3
# Calls to spotify with the new extracted track_id
for entry in data_with_track_ids:
track_id = entry[-1]
try:
audio_features = spotify.audio_features(track_id)
except ReadTimeout:
print('Spotify timed out... trying again...')
audio_features = spotify.audio_features(track_id)
entry += (audio_features[0]['danceability'], audio_features[0]['energy'], audio_features[0]['key'],
audio_features[0]['loudness'], audio_features[0]['mode'], audio_features[0]['speechiness'], audio_features[0]['acousticness'],
audio_features[0]['instrumentalness'], audio_features[0]['liveness'],
audio_features[0]['valence'], audio_features[0]['tempo'], audio_features[0]['duration_ms'], audio_features[0]['time_signature'])
final_data.append(entry)
In the try-except block, you are making a request for every entry in data_with_track_ids. How many elements in the data_with_track_ids data structure? Expect to be throttled and timed out by Spotify servers if you brute force api calls.
You should add a short wait period after timing out to reduce chances
of getting rate-limited or IP banned. Oh wait, looks like when you initialize the spotify variable, the retries are set and handled automatically under the hood in spotipy source code.
EDIT
Here is a way you can avoid making redundant requests by using Python's set data structure. This can serve as your "cache":
# Calls to spotify with the new extracted track_id
ids_seen = set()
for entry in data_with_track_ids:
track_id = entry[-1]
if track_id not in ids_seen:
try:
# retries are already built-in and defined in your __main__(), spotify variable
audio_features = spotify.audio_features(track_id)
except SpotifyException as se:
print('Spotify timed out...Maximum retries exceeded...moving on to next track_id...')
print("TRACK ID IS: {}".format(track_id))
print("Error details: {}".format(se))
ids_seen.add(track_id)
continue
# on success, add track id to ids seen
ids_seen.add(track_id)
else:
print("We have seen this ID before... ID = {}".format(track_id))
continue # skips the next 5 instructions and starts again at top of loop, next iteration
entry += (audio_features[0]['danceability'], audio_features[0]['energy'], audio_features[0]['key'],
audio_features[0]['loudness'], audio_features[0]['mode'], audio_features[0]['speechiness'], audio_features[0]['acousticness'],
audio_features[0]['instrumentalness'], audio_features[0]['liveness'],
audio_features[0]['valence'], audio_features[0]['tempo'], audio_features[0]['duration_ms'], audio_features[0]['time_signature'])
final_data.append(entry)
If you are limited to 1000 req/day, then simply sleep program for 24 hours or stop program (and save current iteration and data context), and run again after you are allowed more requests.
See https://developer.spotify.com/documentation/web-api/guides/rate-limits/
Profile, profile, profile. But the bottleneck is likely soptify's api. Whilst you can parallelise to speed up fetching, they won't thank you much for it and you will likely find yourself rate limited if you do it too much. So profile and see what is taking the time, but be prepared to cut back on your dataset.
Ask yourself what you can do to speed up the algorithm:
can you just fetch the top N hits?
do you really need all that data?
is any data duplicated?
Even if data isn't duplicated, create a local cache, indexed by the track_id, and store every request in that. Rather than requesting from the spotify endpoint, look it up in the cache (store the data in another sqlite database, or another table in the same db). If nothing is returned, fetch, save the data to the cache, and then return it. That way:
if you are doing redundant lookups, it will be faster.
even if you aren't, you can re-run your code blazingly fast (at least as regards your current speed) if you change something and need to run the lot again.
So cache, profile, and look at your algorithm.
You're calling spotify.audio_features(track_id) for every single track, even if you've already fetched its data. Each Friday's results should only introduce a few new songs, but you're re-fetching information on all 200. Don't do that. Make another database table for song info. After you've fetched a track_id's info, write it to the database. Before fetching a track_id's info, see if you've already stored it in the database. Then you'll only be making the bare minimum of necessary API calls, rather than 200 * num_weeks * num_countries.
So im trying to get data from multiple stocks from yahoo finance and write it to excell.
The problem is at the moment i have to hardcode the stocks in question. and currently i would like to download the information from all 25 stocks in the C25 index (^OMXC25) or potentially other indexes. so therefore i would like to know how i can acces the components list and retrieve these and then download each of them. The current code i use to get each is as follows:
import pandas as pd
import pandas_datareader as pdr
import datetime as dt
download_source = (r'C:\Users\SKlin\Downloads\OMXC25.xlsx')
start = dt.datetime(2010,1,1)
end = dt.datetime.today()
writer = pd.ExcelWriter(download_source, engine ='xlsxwriter')
#GN Store Nord
dfGN = pdr.get_data_yahoo('GN.CO',start,end)
dfGN.to_excel(writer, sheet_name='GN.CO')
#Vestas Wind systems
dfVestas = pdr.get_data_yahoo('VWS.CO',start,end)
dfVestas.to_excel(writer, sheet_name='VWS.CO')
writer.save()
This saves the data just fint, but with 25 stocks it's doable, but seems tidious to do with index with 500 stocks.. Plz help.
Use beautiful soup to scrape a list of the ticker names from wiki:
https://en.wikipedia.org/wiki/OMX_Copenhagen_25
Then just iterate through them.
If you can get a list of all symbols:
stocks = ['GN.CO' , 'VWS.CO']
for stock in stocks:
dfGN = pdr.get_data_yahoo(stock ,start,end)
dfGN.to_excel(writer, sheet_name=stock )
I have this DataFrame:
Title Authors Institutions
0 a ['name_1', 'name_2'] [['Osaka Univ.', '34.82,135.52']]
1 b ['name_1'] [['Tohoku Univ.', '38.25,140.87'], ['Kobe Univ.', '34.72,135.23']]
2 c …
3 d …
4 e …
which I convert to a JSON file:
df_output.to_json('output.json', orient='records', lines=True)
Getting:
{"Title": "a","Authors":["name_1", "name_2"],"Institutions":[["Osaka
University", "34.82,135.52"]]}
{"Title": "b","Authors":["name_1"],"Institutions":[['Tohoku Univ.', "38.25,
140.87"], ['Kobe Univ.', "34.72, 135.23"]]}
...
So index this JSON into Elasticsearch to then search them by Title.
import requests
import json
from elasticsearch import Elasticsearch
url= 'https://"""my_session_in_amazon""".amazonaws.com'
es = Elasticsearch([url])
filename = 'C:/xx/xxx/output.json'
data = [json.loads(line) for line in open(filename, 'r')]
helpers.bulk(es, data, index='title', doc_type='HEP_books')
But then in Kibana I dont know how to access the institutions coordinates to plot a map of the institutions
From the sample data frame you pasted, it looks like 'Institutions' is an array which contains both the institution name and the coordinates of it. this will make it impossible to plot those coordinates on a map, as elasticsearch dynamic mapping would consider 'Institutions' as a string/keyword, and not a geo_point/number.
The first step you need is to extract the coordinates to a dedicated field - for example Institutions.geo . You can use ingest pipelines in order to extract it and modify the docs.
Second, you need to specify in Elasticsearch template for those indices that Institutions.geo (for example) is a geo_point, and create a new index for this data.
Third, after the data is clean, in a separate dedicated field, and have the right mapping, you need to refresh fields list in kibana, in order for kibana to recognise the new Institution.geo field.
Fourth, after refresh the mappings in kibana, you can go ahead and create a new Map visualisation based on this data.
I have this code to download data from yahoo:
#gets data from yahoo finance
stocks = list(newmerge.index)
start = dt.datetime(2012,1,1)
end = dt.datetime.today()
yahoodata = pdr.get_data_yahoo(stocks,start,end)
cleanData = yahoodata.loc['Adj Close']
dataFrame = pd.DataFrame(cleanData, columns=stocks)
It works fine but I noticed a problem recently, it doesn't download data for stocks "BRK.B" , and "BR.B" .
I have a list of all the stocks called "stocks" , and here's what I've done, but it still doesn't show data for stocks w/ dot in them:
def stocksdot(stocks):
stocks_dash = str(stocks).replace('.','-')
stockslist = stocks_dash.split(',')
return stockslist
stocksdot(stocks)
My expected output would be to download all stocks, even those with a dot in them. Any ideas how to circumvent?
Your problem is Yahoo Finance doesn't use the "." notation to track shares of different classes. So, "BRK.B" and "BR.B" are actually "BRKB" and "BRB".
Using the Yahoo Finance python SDK I made a little script to test whether or not Yahoo Finance could find information about a stock with the ticker "BRK.B" or "BR.B".
from yahoo_finance import Share
stock = Share('BRK.B')
print(stock.get_price())
This results is:
>>>> None
Stock tickers with a dot in them are used as a shorthand for a type or class of a specific stock. You can learn more here.
To circumvent it looks like you can remove the ".". For example when I use "BRKB" instead of "BRK.B" I get the result:
>>>> 173.05
Which is the current price of Berkshire Hathaway class B stock.
To replace the "." programatically use Python's .replace() method.
for stock in stocks:
stock = stock.replace(".", "") # Replaces all "." with "" in the string
# stock
Your problem is Yahoo Finance doesn't use the "." notation to track shares of different classes. So, "BRK.B" and "BR.B" are actually "BRKB" and "BRB". --- My comment: Now "BRK.B" and "BR.B" are actually "BRK-B" and "BR-B".