How to process this data to create a stacked bar chart? - python

so I am writing a program that monitors and records your usage time of foreground applications and saves them in a SQL database. I then want to retrieve the data from previous days and compile it all together into a stacked bar chart. Here, the x-axis will have the different days over which usage was recorded, and the various stacks in each bar will represent each app that was used.
In my program, I created 2 tables, one to record each day's app usage (with each new day's data having a different primary key id), and another table to record the primary key for each day.
Table 1:
_id
Application
usage_time
0
Google Chrome
245.283942928347
1
PyCharm
450.3939754962921
1
SQLiteStudio
140.2376308441162
1
Google Chrome
5.008131980896
Table 2:
Date
daily_id
2021-07-18 07:25:25.376734
0
2021-07-18 07:27:57.419574
1
Within my stacked bar chart program, I have come up with this code to refine the data to put into the stacked bar chart:
conn = sqlite3.connect('daily_usage_monitor.sqlite', detect_types=sqlite3.PARSE_DECLTYPES)
all_app_data = conn.execute('SELECT all_usage_information.date, monitor.application, monitor.usage_time FROM all_usage_information INNER JOIN monitor ON all_usage_information.daily_id = monitor._id ORDER BY all_usage_information.date, monitor.usage_time ASC').fetchall()
for date, app, usage_time in all_app_data:
print(f'{date} - {app}: {usage_time}')
conn.close()
daily_data = {}
# Create nested dictionary - key = each date, value = dictionary of different apps & their time usage durations
for date, app, time in all_app_data:
conditions = [date not in daily_data, app != 'loginwindow']
if all(conditions):
daily_data[date] = {app: time}
elif not conditions[0] and conditions[1]:
daily_data[date].update({app: time})
print(daily_data) # TODO: REMOVE AFTER TESTING
total_time = 0
# Club any applications that account for <5% of total time into 1 category called 'Other'
for date, app_usages in daily_data.items():
total_time = sum(time for app, time in app_usages.items())
refined_data = {}
for key, value in app_usages.items():
if value/total_time < 0.05:
refined_data['Others'] = refined_data.setdefault('Others', 0) + value
else:
refined_data[key] = value
daily_data[date] = refined_data
print(daily_data) # TODO: REMOVE AFTER TESTING
# Takes the nested dictionary and breaks it into a labels list and a dictionary with apps & time usages for each day
# Sorts data so it can be used to create composite bar chart
final_data = {}
labels = []
for date, app_usages in daily_data.items():
labels.append(date)
for app, time in app_usages.items():
if app not in final_data:
final_data[app] = [time]
else:
final_data[app].append(time)
This is the kind of output I am currently getting:
{'Google Chrome': [245.283942928347, 190.20031905174255], 'SQLiteStudio': [145.24058270454407], 'PyCharm': [1166.0021023750305]}
The problem here is that for the days where an application had 0 usage time, it is not being recorded in the list. Therefore, the insertion order of the stacked bar chart will not be correct and will show the wrong apps for the wrong dates. How can I fix this?
This is one method I tried, but of course it's not working because you cannot index into a dictionary:
for app, usage in final_data.items():
for date, app_usages in daily_data.items():
if app not in app_usages:
usage.insert(app_usages.index(app), 0)

Related

Align slider marks frequency depending on other sliders mark values

I’m trying to get an idea on how to achieve this task. I created multiple sliders connected to multiple datatables.
Each slider queries selected dates data from a specific collection (MongoDB)
Right now my sliders works just fine.
#define database.
database = client["signals"]
#define collections
daily_signals = database["daily_signals"]
days3 = database["3d_signals"]
#query dates as a dataframe for slider marks.
dfd = pd.DataFrame(daily_signals.distinct("Date"), columns=['Date'])
df3d = pd.DataFrame(days3.distinct("Date"), columns=['Date'])
numdate = [x for x in range(len(dfd['Date'].unique()))]
numdate3d = [x for x in range(len(df3d['Date'].unique()))]
app = Dash(__name__)
#sliders
app.layout = html.Div([
dcc.Slider(min=numdatem[0],
max=numdatem[-1],
value=numdatem[-1],
marks = {numd:date.strftime('%m') for numd,date in zip(numdatew, dfm['Date'].dt.date.unique())},
step=None,
included=False
),
dcc.Slider(min=numdate2w[0],
max=numdate2w[-1],
value=numdate2w[-1],
marks = {numd:date.strftime('%d/%m') for numd,date in zip(numdatew, df2w['Date'].dt.date.unique())},
step=None,
included=False
),])
#Calbacks and functions to create datatables etc.
if __name__ == '__main__':
app.run_server(debug=True)
There are five different timeframes D,3D,W,2W,M. I only added 2 of them here.
What i want to achieve is something like a vintage radio slider but it will more look like a calender which only shows workdays with periods of daily, 3 days, weekly, 2 weeks and monthly periods. for instance each mark on daily slider represents a workday. So a mark on 3 days slider should represent 3 days, 5 days for weekly slider, 10 days for 2 weeks slider and 30 days for monthly(M) slider. All I want is to align slider marks with this order.
Thanks in advance.

Filter two Google Earth Engine image collections by similar acquisition dates - within one day of each other

This question is based on a previous post: https://gis.stackexchange.com/questions/386827/filter-two-image-collections-by-the-same-aqusition-date-in-google-earth-engine
I would like to expand the question to try to filter collections to get images that are within one day of each other (i.e., image from collection 1 is within +/- one day of image from collection 2) in Google Earth Engine (python api).
I have written the following code in geemap (based on previous post) to get coincident image dates, given two image collections:
col1 = ee.ImageCollection('LANDSAT/LE07/C02/T1_L2') \
.filter(ee.Filter.calendarRange(1998, 2013,'year')) \
.filterBounds(pts)
col2 = ee.ImageCollection('LANDSAT/LT05/C02/T1_L2') \
.filter(ee.Filter.calendarRange(1998, 2012,'year')) \
.filterBounds(pts)
def datefunc(image):
date = ee.Date(image.get('system:time_start')).format("YYYY-MM-dd")
date = ee.Date(date)
return image.set('date', date)
col1_date = col1.map(datefunc)
col2_date = col2.map(datefunc)
filterTimeEq = ee.Filter.equals(
leftField='date',
rightField='date'
)
simpleJoin = ee.Join.simple()
col_new = ee.ImageCollection(simpleJoin.apply(col1_date, col2_date, filterTimeEq))
Is there any way I can modify the above code to get near coincident imagery?
Solved! I got it working using ee.Filter.maxDifference. I used the original 'system:time_start' values (which are in milliseconds) and set the max difference as one day. The full code looks like this:
filterNear = ee.Filter.maxDifference(
difference = 86400000,
leftField='system:time_start',
rightField='system:time_start'
)
col_near = ee.ImageCollection(simpleJoin.apply(col1, col2, filterNear))
Where 86400000 is the number of milliseconds in 24 hours.

How to speed up Spotipy API calls for millions of records?

I'm attempting to get the audio feature data for about 4.5 years worth of Spotify Top 200 Charts. It's for 68 countries + global ranking, so about 20 million records in all. I'm querying a SQL Lite database with all of that data.
This is prep for a data analysis project and I've currently limited my scope to just pulling the 3rd Friday of every month because the fastest time I could get pulling an entire day's worth of audio features for the charts is 15.8 minutes. That's 18.5 days of straight processing to get all 1701 days.
Does anyone see any way I could make this faster? I'm currently calling the spotipy.audio_features() function for each track id. The function is limited to 100 ids and I'm not so sure that would be much faster anyway.
Here's an example entry before processing:
column_names = ['title', 'rank', 'date', 'artist', 'url', 'region', 'chart', 'trend', 'streams']
('You Were Right', 179, '2017-01-20', 'RÜFÜS DU SOL', 'https://open.spotify.com/track/77lqbary6vt1DSc1MBN6sx', 'Australia', 'top200', 'NEW_ENTRY', 14781)
And after processing:
column_names = ['title', 'rank', 'date', 'artist', 'url', 'region', 'chart', 'trend', 'streams', 'track_id', 'danceability', 'energy', 'key', 'loudness', 'mode', 'speechiness', 'acousticness', 'instrumentalness', 'liveness', 'valence', 'tempo', 'duration_ms', 'time_signature']
('You Were Right', 179, '2017-01-20', 'RÜFÜS DU SOL', 'https://open.spotify.com/track/77lqbary6vt1DSc1MBN6sx', 'Australia', 'top200', 'NEW_ENTRY', 14781, '77lqbary6vt1DSc1MBN6sx', 0.708, 0.793, 5, -5.426, 0, 0.0342, 0.0136, 0.00221, 0.118, 0.734, 122.006, 239418, 4)
Full Script:
import sqlite3
import os
import spotipy
import numpy as np
import pandas as pd
from spotipy.oauth2 import SpotifyClientCredentials
from requests.exceptions import ReadTimeout
from datetime import datetime
"""Gets the third Friday of each month and checks that the date exists in the database."""
def date_range_checker(cursor, start_date, end_date):
# Put in the range for that year. It's till 2021.
date_range = pd.date_range(start_date, end_date ,freq='WOM-3FRI')
cursor.execute("""SELECT DISTINCT Date(date) FROM charts""")
sql_date_fetch = cursor.fetchall()
sql_dates = [r[0] for r in sql_date_fetch]
validated_dates = []
for date in date_range:
# print(str(date)[0:-9])
if str(date)[0:-9] in sql_dates:
validated_dates.append(str(date)[0:-9])
return validated_dates
"""Connects to the database. For each date in validated_dates, it queries all the records with that date.
Then splits the track IDs from the Spotify link into a new list of tuples. Then for each tuple in that list, it calls the Spotify API with the track ID.
Finally it creates a numpy array for the entire list so the csv converter can be used."""
def main():
now_start = datetime.now()
start_time = now_start.strftime("%H:%M:%S")
print(f'Main Function - start time: {start_time}')
""""This script queries """
print("working on it...")
dbname = 'charts.db'
if os.path.exists(dbname):
db = sqlite3.connect(dbname, isolation_level=None)
cursor = db.cursor()
""""Selects 3rd friday of the month because it takes about 15.8 minutes per day. That's 14.2 hours total to get one friday a month for all 4.5 years.
Or 18.6 full days of processing for every single day for all 1701 days.
Fridays are a preferable release day in the industry. Cite this later."""
# Date range list created and checked in this function
validated_dates = date_range_checker(cursor, '2017-02-01', '2017-12-31') # change year here
column_names = ['title', 'rank', 'date', 'artist', 'url', 'region', 'chart', 'trend', 'streams', 'track_id', 'danceability',
'energy', 'key', 'loudness', 'mode', 'speechiness', 'acousticness', 'instrumentalness', 'liveness', 'valence', 'tempo',
'duration_ms', 'time_signature']
for date_chosen in validated_dates:
cursor.execute("""SELECT * FROM charts WHERE Date("date") = ?""", (date_chosen,))
db_result = cursor.fetchall()
data_with_track_ids = []
final_data = []
# Splits ID from Spotify link.
for entry in db_result:
track_id = entry[4].split('/')[-1]
entry += (track_id,)
data_with_track_ids.append(entry)
print("I've got all the track IDs. Will start calls to Spotify API now.")
# Calls to spotify with the new extracted track_id
for entry in data_with_track_ids:
track_id = entry[-1]
try:
audio_features = spotify.audio_features(track_id)
except ReadTimeout:
print('Spotify timed out... trying again...')
audio_features = spotify.audio_features(track_id)
entry += (audio_features[0]['danceability'], audio_features[0]['energy'], audio_features[0]['key'],
audio_features[0]['loudness'], audio_features[0]['mode'], audio_features[0]['speechiness'], audio_features[0]['acousticness'],
audio_features[0]['instrumentalness'], audio_features[0]['liveness'],
audio_features[0]['valence'], audio_features[0]['tempo'], audio_features[0]['duration_ms'], audio_features[0]['time_signature'])
final_data.append(entry)
np_data = np.array(final_data)
my_dataframe = pd.DataFrame(np_data, columns=column_names)
my_dataframe.to_csv(f'spotify_csv_data/spotify_top_200 {date_chosen}.csv')
now_end = datetime.now()
end_time = now_end.strftime("%H:%M:%S")
print(f'Main Function - Start time: {start_time}. End time: {end_time}.')
print(f'The date {date_chosen} took {now_end - now_start} to run.')
db.close()
if __name__ == "__main__":
now_start = datetime.now()
start_time = now_start.strftime("%H:%M:%S")
print(f'Script - start time: {start_time}')
os.environ['SPOTIPY_CLIENT_ID'] = 'ENTER YOUR CLIENT_ID'
os.environ['SPOTIPY_CLIENT_SECRET'] = 'ENTER YOUR CLIENT_SECRET'
# Allows for retries. Seems to be enough that it doesn't crash.
spotify = spotipy.Spotify(client_credentials_manager=SpotifyClientCredentials(), requests_timeout=10, retries=10)
"""Leave above set."""
main()
now_end = datetime.now()
end_time = now_end.strftime("%H:%M:%S")
print(f'Script - Start time: {start_time}. End time: {end_time}.')
print(f'This script took {now_end - now_start} to run.\n')
A few ideas to improve performance:
Use parallel processing
Since you are using Python, the code running is single threaded.
Using Python's multiprocessing library, you could (for example) run 4 instances of the same code but with evenly divided start/end dates. This can make your data processing ~4x faster. You would just need to write the data in such a way that there is no overlap.
Note: If you are rate limited by the Spotify API (you most likely will be), you can use different API keys for each instance. (Make multiple accounts or borrow a friends API key).
Sql query optimizations
It's worth looking into your queries to see what is going wrong. I'm personally not familiar with SQL, just giving you ideas.
Profile your program to understand more.
See How can you profile a Python script?
Use some sort of caching technique to avoid redundant api calls and to avoid populating duplicate data. (See a potential solution below, in last block of code using ids_seen)
python3
# Splits ID from Spotify link.
for entry in db_result:
track_id = entry[4].split('/')[-1]
entry += (track_id,)
data_with_track_ids.append(entry)
In this code, what type is entry? How big is db_result?
Another thing worth mentioning regarding your following code:
python3
# Calls to spotify with the new extracted track_id
for entry in data_with_track_ids:
track_id = entry[-1]
try:
audio_features = spotify.audio_features(track_id)
except ReadTimeout:
print('Spotify timed out... trying again...')
audio_features = spotify.audio_features(track_id)
entry += (audio_features[0]['danceability'], audio_features[0]['energy'], audio_features[0]['key'],
audio_features[0]['loudness'], audio_features[0]['mode'], audio_features[0]['speechiness'], audio_features[0]['acousticness'],
audio_features[0]['instrumentalness'], audio_features[0]['liveness'],
audio_features[0]['valence'], audio_features[0]['tempo'], audio_features[0]['duration_ms'], audio_features[0]['time_signature'])
final_data.append(entry)
In the try-except block, you are making a request for every entry in data_with_track_ids. How many elements in the data_with_track_ids data structure? Expect to be throttled and timed out by Spotify servers if you brute force api calls.
You should add a short wait period after timing out to reduce chances
of getting rate-limited or IP banned. Oh wait, looks like when you initialize the spotify variable, the retries are set and handled automatically under the hood in spotipy source code.
EDIT
Here is a way you can avoid making redundant requests by using Python's set data structure. This can serve as your "cache":
# Calls to spotify with the new extracted track_id
ids_seen = set()
for entry in data_with_track_ids:
track_id = entry[-1]
if track_id not in ids_seen:
try:
# retries are already built-in and defined in your __main__(), spotify variable
audio_features = spotify.audio_features(track_id)
except SpotifyException as se:
print('Spotify timed out...Maximum retries exceeded...moving on to next track_id...')
print("TRACK ID IS: {}".format(track_id))
print("Error details: {}".format(se))
ids_seen.add(track_id)
continue
# on success, add track id to ids seen
ids_seen.add(track_id)
else:
print("We have seen this ID before... ID = {}".format(track_id))
continue # skips the next 5 instructions and starts again at top of loop, next iteration
entry += (audio_features[0]['danceability'], audio_features[0]['energy'], audio_features[0]['key'],
audio_features[0]['loudness'], audio_features[0]['mode'], audio_features[0]['speechiness'], audio_features[0]['acousticness'],
audio_features[0]['instrumentalness'], audio_features[0]['liveness'],
audio_features[0]['valence'], audio_features[0]['tempo'], audio_features[0]['duration_ms'], audio_features[0]['time_signature'])
final_data.append(entry)
If you are limited to 1000 req/day, then simply sleep program for 24 hours or stop program (and save current iteration and data context), and run again after you are allowed more requests.
See https://developer.spotify.com/documentation/web-api/guides/rate-limits/
Profile, profile, profile. But the bottleneck is likely soptify's api. Whilst you can parallelise to speed up fetching, they won't thank you much for it and you will likely find yourself rate limited if you do it too much. So profile and see what is taking the time, but be prepared to cut back on your dataset.
Ask yourself what you can do to speed up the algorithm:
can you just fetch the top N hits?
do you really need all that data?
is any data duplicated?
Even if data isn't duplicated, create a local cache, indexed by the track_id, and store every request in that. Rather than requesting from the spotify endpoint, look it up in the cache (store the data in another sqlite database, or another table in the same db). If nothing is returned, fetch, save the data to the cache, and then return it. That way:
if you are doing redundant lookups, it will be faster.
even if you aren't, you can re-run your code blazingly fast (at least as regards your current speed) if you change something and need to run the lot again.
So cache, profile, and look at your algorithm.
You're calling spotify.audio_features(track_id) for every single track, even if you've already fetched its data. Each Friday's results should only introduce a few new songs, but you're re-fetching information on all 200. Don't do that. Make another database table for song info. After you've fetched a track_id's info, write it to the database. Before fetching a track_id's info, see if you've already stored it in the database. Then you'll only be making the bare minimum of necessary API calls, rather than 200 * num_weeks * num_countries.

How to match asset price data from a csv file to another csv file with relevant news by date

I am researching the impact of news article sentiment related to a financial instrument and its potenatial effect on its instruments's price. I have tried to get the timestamp of each news item, truncate it to minute data (ie remove second and microsecond components) and get the base shareprice of an instrument at that time, and at several itervals after that time, in our case t+2. However, program created twoM to the file, but does not return any calculated price changes
Previously, I used Reuters Eikon and its functions to conduct the research, described in the article below.
https://developers.refinitiv.com/article/introduction-news-sentiment-analysis-eikon-data-apis-python-example
However, instead of using data available from Eikon, I would like to use my own csv news file with my own price data from another csv file. I am trying to match the
excel_file = 'C:\\Users\\Artur\\PycharmProjects\\JRA\\sentimenteikonexcel.xlsx'
df = pd.read_excel(excel_file)
sentiment = df.Sentiment
print(sentiment)
start = df['GMT'].min().replace(hour=0,minute=0,second=0,microsecond=0).strftime('%Y/%m/%d')
end = df['GMT'].max().replace(hour=0,minute=0,second=0,microsecond=0).strftime('%Y/%m/%d')
spot_data = 'C:\\Users\\Artur\\Desktop\\stocksss.csv'
spot_price_10 = pd.read_csv(spot_data)
print(spot_price_10)
df['twoM'] = np.nan
for idx, newsDate in enumerate(df['GMT'].values):
sTime = df['GMT'][idx]
sTime = sTime.replace(second=0, microsecond=0)
try:
t0 = spot_price_10.iloc[spot_price_10.index.get_loc(sTime),2]
df['twoM'][idx] = ((spot_price_10.iloc[spot_price_10.index.get_loc((sTime + datetime.timedelta(minutes=10))),3]/(t0)-1)*100)
except:
pass
print(df)
However, the programm is not able to return the twoM price change values
I assume that you got a warning because you are trying to make changes on views. As soon as you have 2 [] (one for the column, one for the row) you can only read. You must use loc or iloc to write a value:
...
try:
t0 = spot_price_10.iloc[spot_price_10.index.get_loc(sTime),2]
df.loc[idx,'twoM'] = ((spot_price_10.iloc[spot_price_10.index.get_loc((sTime + datetime.timedelta(minutes=10))),3]/(t0)-1)*100)
except:
pass
...

Contact Tracing in Python - working with timeseries

Let's say I have timeseries data (time on the x-axis, coordinates on the y-z plane.
Given a seed set of infected users, I want to fetch all users that are within distance d from the points in the seed set within t time. This is basically just contact tracing.
What is a smart way of accomplishing this?
The naive approach is something like this:
points_at_end_of_iteration = []
for p in seed_set:
other_ps = find_points_t_time_away(t)
points_at_end_of_iteration += find_points_d_distance_away_from_set(other_ps)
What is a smarter way of doing this - preferably keeping all data in RAM (though I'm not sure if this is feasible). Is Pandas a good option? I've been thinking about Bandicoot as well, but it doesn't seem to be able to do that for me.
Please let me know if I can improve the question - perhaps it's too broad.
Edit:
I think the algorithm I presented above is flawed.
Is this better:
for user,time,pos in infected_set:
info = get_next_info(user, time) # info will be a tuple: (t, pos)
intersecting_users = find_intersecting_users(user, time, delta_t, pos, delta_pos) # intersect if close enough to the user's pos/time
infected_set.add(intersecting_users)
update_infected_set(user, info) # change last_time and last_pos (described below)
infected_set I think should actually be a hashmap {user_id: {last_time: ..., last_pos: ...}, user_id2: ...}
One potential problem is that the users are treated independently, so the next timestamp for user2 may be hours or days after user1.
Contact tracing may be easier if I interpolate so that every user has information for every time point (say ever hour) though that would increase the amount of data by a huge amount.
Data Format/Sample
user_id = 123
timestamp = 2015-05-01 05:22:25
position = 12.111,-12.111 # lat,long
There is one csv file with all the records:
uid1,timestamp1,position1
uid1,timestamp2,position2
uid2,timestamp3,position3
There is also a directory of files (same format) where each file corresponds to a user.
records/uid1.csv
records/uid2.csv
First solution with interpolate:
# i would use a shelf (a persistent, dictionary-like object,
# included with python).
import shelve
# hashmap of clean users indexed by timestamp)
# { timestamp1: {uid1: (lat11, long11), uid12: (lat12, long12), ...},
# timestamp2: {uid1: (lat21, long21), uid2: (lat22, long22), ...},
# ...
# }
#
clean_users = shelve.open("clean_users.dat")
# load data in clean_users from csv (shelve use same syntax than
# hashmap). You will interpolate data (only data at a given timestamp
# will be in memory at the same time). Note: the timestamp must be a string
# hashmap of infected users indexed by timestamp (same format than clean_users)
infected_users = shelve.open("infected_users.dat")
# for each iteration
for iteration in range(1, N):
# compute current timestamp because we interpolate each user has a location
current_timestamp = timestamp_from_iteration(iteration)
# get clean users for this iteration (in memory)
current_clean_users = clean_user[current_timestamp]
# get infected users for this iteration (in memory)
current_infected_users = infected_user[current_timestamp]
# new infected user for this iteration
new_infected_users = dict()
# compute new infected_users for this iteration from current_clean_users and
# current_infected_users then store the result in new_infected_users
# remove user in new_infected_users from clean_users
# add user in new_infected_users to infected_users
# close the shelves
infected_users.close()
clean_users.close()
Second solution without interpolate:
# i would use a shelf (a persistent, dictionary-like object,
# included with python).
import shelve
# hashmap of clean users indexed by timestamp)
# { timestamp1: {uid1: (lat11, long11), uid12: (lat12, long12), ...},
# timestamp2: {uid1: (lat21, long21), uid2: (lat22, long22), ...},
# ...
# }
#
clean_users = shelve.open("clean_users.dat")
# load data in clean_users from csv (shelve use same syntax than
# hashmap). Note: the timestamp must be a string
# hashmap of infected users indexed by timestamp (same format than clean_users)
infected_users = shelve.open("infected_users.dat")
# for each iteration (not time related as previous version)
# could also stop when there is no new infected users in the iteration
for iteration in range(1, N):
# new infected users for this iteration
new_infected_users = dict()
# get timestamp from infected_users
for an_infected_timestamp in infected_users.keys():
# get infected users for this time stamp
current_infected_users = infected_users[an_infected_timestamp]
# get relevant timestamp from clean users
for a_clean_timestamp in clean_users.keys():
if time_stamp_in_delta(an_infected_timestamp, a_clean_timestamp):
# get clean users for this clean time stamp
current_clean_users = clean_users[a_clean_timestamp]
# compute infected users from current_clean_users and
# current_infected_users then append the result to
# new_infected_users
# remove user in new_infected_users from clean_users
# add user in new_infected_users to infected_users
# close the shelves
infected_users.close()
clean_users.close()

Categories