Contact Tracing in Python - working with timeseries - python

Let's say I have timeseries data (time on the x-axis, coordinates on the y-z plane.
Given a seed set of infected users, I want to fetch all users that are within distance d from the points in the seed set within t time. This is basically just contact tracing.
What is a smart way of accomplishing this?
The naive approach is something like this:
points_at_end_of_iteration = []
for p in seed_set:
other_ps = find_points_t_time_away(t)
points_at_end_of_iteration += find_points_d_distance_away_from_set(other_ps)
What is a smarter way of doing this - preferably keeping all data in RAM (though I'm not sure if this is feasible). Is Pandas a good option? I've been thinking about Bandicoot as well, but it doesn't seem to be able to do that for me.
Please let me know if I can improve the question - perhaps it's too broad.
Edit:
I think the algorithm I presented above is flawed.
Is this better:
for user,time,pos in infected_set:
info = get_next_info(user, time) # info will be a tuple: (t, pos)
intersecting_users = find_intersecting_users(user, time, delta_t, pos, delta_pos) # intersect if close enough to the user's pos/time
infected_set.add(intersecting_users)
update_infected_set(user, info) # change last_time and last_pos (described below)
infected_set I think should actually be a hashmap {user_id: {last_time: ..., last_pos: ...}, user_id2: ...}
One potential problem is that the users are treated independently, so the next timestamp for user2 may be hours or days after user1.
Contact tracing may be easier if I interpolate so that every user has information for every time point (say ever hour) though that would increase the amount of data by a huge amount.
Data Format/Sample
user_id = 123
timestamp = 2015-05-01 05:22:25
position = 12.111,-12.111 # lat,long
There is one csv file with all the records:
uid1,timestamp1,position1
uid1,timestamp2,position2
uid2,timestamp3,position3
There is also a directory of files (same format) where each file corresponds to a user.
records/uid1.csv
records/uid2.csv

First solution with interpolate:
# i would use a shelf (a persistent, dictionary-like object,
# included with python).
import shelve
# hashmap of clean users indexed by timestamp)
# { timestamp1: {uid1: (lat11, long11), uid12: (lat12, long12), ...},
# timestamp2: {uid1: (lat21, long21), uid2: (lat22, long22), ...},
# ...
# }
#
clean_users = shelve.open("clean_users.dat")
# load data in clean_users from csv (shelve use same syntax than
# hashmap). You will interpolate data (only data at a given timestamp
# will be in memory at the same time). Note: the timestamp must be a string
# hashmap of infected users indexed by timestamp (same format than clean_users)
infected_users = shelve.open("infected_users.dat")
# for each iteration
for iteration in range(1, N):
# compute current timestamp because we interpolate each user has a location
current_timestamp = timestamp_from_iteration(iteration)
# get clean users for this iteration (in memory)
current_clean_users = clean_user[current_timestamp]
# get infected users for this iteration (in memory)
current_infected_users = infected_user[current_timestamp]
# new infected user for this iteration
new_infected_users = dict()
# compute new infected_users for this iteration from current_clean_users and
# current_infected_users then store the result in new_infected_users
# remove user in new_infected_users from clean_users
# add user in new_infected_users to infected_users
# close the shelves
infected_users.close()
clean_users.close()
Second solution without interpolate:
# i would use a shelf (a persistent, dictionary-like object,
# included with python).
import shelve
# hashmap of clean users indexed by timestamp)
# { timestamp1: {uid1: (lat11, long11), uid12: (lat12, long12), ...},
# timestamp2: {uid1: (lat21, long21), uid2: (lat22, long22), ...},
# ...
# }
#
clean_users = shelve.open("clean_users.dat")
# load data in clean_users from csv (shelve use same syntax than
# hashmap). Note: the timestamp must be a string
# hashmap of infected users indexed by timestamp (same format than clean_users)
infected_users = shelve.open("infected_users.dat")
# for each iteration (not time related as previous version)
# could also stop when there is no new infected users in the iteration
for iteration in range(1, N):
# new infected users for this iteration
new_infected_users = dict()
# get timestamp from infected_users
for an_infected_timestamp in infected_users.keys():
# get infected users for this time stamp
current_infected_users = infected_users[an_infected_timestamp]
# get relevant timestamp from clean users
for a_clean_timestamp in clean_users.keys():
if time_stamp_in_delta(an_infected_timestamp, a_clean_timestamp):
# get clean users for this clean time stamp
current_clean_users = clean_users[a_clean_timestamp]
# compute infected users from current_clean_users and
# current_infected_users then append the result to
# new_infected_users
# remove user in new_infected_users from clean_users
# add user in new_infected_users to infected_users
# close the shelves
infected_users.close()
clean_users.close()

Related

Python good practice with NetCDF4 shared dimensions across groups

This question is conceptual in place of a direct error.
I am working with the python netcdf4 api to translate and store binary datagram packets from multiple sensors packaged in a single file. My question is in reference to Scope of dimensions and best use practices.
According to the Netcdf4 convention and metadata docs, dimension scope is such that all child groups have access to a dimension defined in the parent group (http://cfconventions.org/Data/cf-conventions/cf-conventions-1.9/cf-conventions.html#_scope).
Context:
The multiple sensors datapackets are written to a binary file. Timing adjustments are handled prior to writing the binary file such that we can trust the time stamp of a data packet. Time sampling rates are not synchonious. Sensor 1 samples at say 1Hz. Sensor 2 samples at 100Hz. Sensor 1 and 2 measure a number of different variables.
Questions:
Do I define a single, unlimited time dimension at the root level and create multiple variables using that dimension, or create individual time dimensions at the group level. Psuedo-code below.
In setting up the netcdf I would use the following code:
import netcdf4
data_set = netcdf4.Dataset(file_name, mode='w')
dim_time = data_set.createDimension(dimname='time', size=None)
grp_1 = data_set.createGroup('grp1')
var_time_1 = grp_1.createVariable(varname='sensor_1_time', datatype='f8', dimensions=(time,))
var_time_1[:] = sensor_1_time # timestamp of data values from sensor_1
var_1 = grp_1.createVariable(varname='sensor_1_data', datatype='f8',
dimensions=(time,))
var_1[:] = sensor_1_data # data values from sensor 1
grp_2 = data_set.createGroup('grp2')
var_time_2 = grp_2.createVariable(varname='sensor_2_time', datatype='f8', dimensions=(time,))
var_time_2[:] = sensor_2_time
var_2 = grp_2.createVariable(varname='sensor_2_data', datatype='f8', dimension=(time,))
var_2[:] = sensor_2_data # data values from sensor 2
The group separation is not necessarily by sensor but by logical data grouping. In the case that data from two sensors falls into multiple groups, is it best to replicate the time array in each group or is it acceptable to reference to other groups using the Scope mechanism.
import netcdf4
data_set = netcdf4.Dataset(file_name, mode='w')
dim_time = data_set.createDimension(dimname='time', size=None)
grp_env = data_set.createGroup('env_data')
sensor_time_1 = grp_env.createVariable(varname='sensor_1_time', datatype='f8', dimensions=(time,))
sensor_time_1[:] = sensor_1_time # timestamp of data values from sensor_1
env_1 = grp_env.createVariable(varname='sensor_1_data', datatype='f8', dimensions=(time,))
env_1[:] = sensor_1_data # data values from sensor 1
env_2 = grp_1.createVariable(varname='sensor_2_data', datatype='f8', dimensions=(time,))
env_2.coordinates = "/grp_platform/sensor_time_1"
grp_platform = data_set.createGroup('platform')
sensor_time_2 = grp_platform.createVariable(varname='sensor_2_time', datatype='f8', dimensions=(time,))
sensor_time_2[:] = sensor_2_time
plt_2 = grp_platform.createVariable(varname='sensor_2_data', datatype='f8', dimension=(time,))
var_2[:] = sensor_2_data # data values from sensor 2
Most examples do not deal with these cross group functionality and I can't seem to find the best practices. I'd love some advice, or even a push in the right direction.

Given a table with events happening on a specific time; how do I divide the time into frequencies and count them

I've been given a table with card accidents and at which time they occurred (Format HH:MM).
I want to create a new column in the DF where the time is an interval of 15 minutes (ex. 10:30 - 10:45).
With that information, I want to create a count plot that counts the number of accidents for each time interval.
How can I do this?
First you have to use date time library :
import datetime
To get your time in the right format just use the split() function
hour = your_hour_from_table.split(':')
And you will get a list where hour[0] is HH and hour[1] is MM.
Then create a list with all your time slots :
time_slots = ["10:00 - 10:15", "10:15 - 10:30",...]
And also a dictionary with the time slots as keys (use the same string as in time_slots list) and 0 as initial value:
hour_dictionary = {"10:00 - 10:15":0, "10:15 - 10:30":0,...}
Use a for-loop to verify in which time slot your accident was (assuming your accident is stored in accident_hour variable) :
def verify_time_slot(accident_hour):
for i in time_slots:
slot_first_hour = i.split(' - ')[0].split(':')[0]
slot_first_minute = i.split(' - ')[0].split(':')[1]
slot_second_hour = i.split(' - ')[1].split(':')[0]
slot_second_minute = i.split(' - ')[1].split(':')[1]
hour_accident = accident_hour.split(':')[0]
minute_accident = accident_hour.split(':')[1]
if datetime.time(slot_first_hour, slot_first_minute, 0) <= datetime.time(hour_accident, minute_accident, 0) <= datetime.time(slot_second_hour, slot_second_minute, 0):
hour_dictionary[i] += 1
So with this function you can verify the time slots and store them. Then use also a for-loop to to this with every accident_hour you have.
I let you finish the exercise yourself to enter all this informations in a table, I think the main algorithm is already in your hands.

How to speed up Spotipy API calls for millions of records?

I'm attempting to get the audio feature data for about 4.5 years worth of Spotify Top 200 Charts. It's for 68 countries + global ranking, so about 20 million records in all. I'm querying a SQL Lite database with all of that data.
This is prep for a data analysis project and I've currently limited my scope to just pulling the 3rd Friday of every month because the fastest time I could get pulling an entire day's worth of audio features for the charts is 15.8 minutes. That's 18.5 days of straight processing to get all 1701 days.
Does anyone see any way I could make this faster? I'm currently calling the spotipy.audio_features() function for each track id. The function is limited to 100 ids and I'm not so sure that would be much faster anyway.
Here's an example entry before processing:
column_names = ['title', 'rank', 'date', 'artist', 'url', 'region', 'chart', 'trend', 'streams']
('You Were Right', 179, '2017-01-20', 'RÜFÜS DU SOL', 'https://open.spotify.com/track/77lqbary6vt1DSc1MBN6sx', 'Australia', 'top200', 'NEW_ENTRY', 14781)
And after processing:
column_names = ['title', 'rank', 'date', 'artist', 'url', 'region', 'chart', 'trend', 'streams', 'track_id', 'danceability', 'energy', 'key', 'loudness', 'mode', 'speechiness', 'acousticness', 'instrumentalness', 'liveness', 'valence', 'tempo', 'duration_ms', 'time_signature']
('You Were Right', 179, '2017-01-20', 'RÜFÜS DU SOL', 'https://open.spotify.com/track/77lqbary6vt1DSc1MBN6sx', 'Australia', 'top200', 'NEW_ENTRY', 14781, '77lqbary6vt1DSc1MBN6sx', 0.708, 0.793, 5, -5.426, 0, 0.0342, 0.0136, 0.00221, 0.118, 0.734, 122.006, 239418, 4)
Full Script:
import sqlite3
import os
import spotipy
import numpy as np
import pandas as pd
from spotipy.oauth2 import SpotifyClientCredentials
from requests.exceptions import ReadTimeout
from datetime import datetime
"""Gets the third Friday of each month and checks that the date exists in the database."""
def date_range_checker(cursor, start_date, end_date):
# Put in the range for that year. It's till 2021.
date_range = pd.date_range(start_date, end_date ,freq='WOM-3FRI')
cursor.execute("""SELECT DISTINCT Date(date) FROM charts""")
sql_date_fetch = cursor.fetchall()
sql_dates = [r[0] for r in sql_date_fetch]
validated_dates = []
for date in date_range:
# print(str(date)[0:-9])
if str(date)[0:-9] in sql_dates:
validated_dates.append(str(date)[0:-9])
return validated_dates
"""Connects to the database. For each date in validated_dates, it queries all the records with that date.
Then splits the track IDs from the Spotify link into a new list of tuples. Then for each tuple in that list, it calls the Spotify API with the track ID.
Finally it creates a numpy array for the entire list so the csv converter can be used."""
def main():
now_start = datetime.now()
start_time = now_start.strftime("%H:%M:%S")
print(f'Main Function - start time: {start_time}')
""""This script queries """
print("working on it...")
dbname = 'charts.db'
if os.path.exists(dbname):
db = sqlite3.connect(dbname, isolation_level=None)
cursor = db.cursor()
""""Selects 3rd friday of the month because it takes about 15.8 minutes per day. That's 14.2 hours total to get one friday a month for all 4.5 years.
Or 18.6 full days of processing for every single day for all 1701 days.
Fridays are a preferable release day in the industry. Cite this later."""
# Date range list created and checked in this function
validated_dates = date_range_checker(cursor, '2017-02-01', '2017-12-31') # change year here
column_names = ['title', 'rank', 'date', 'artist', 'url', 'region', 'chart', 'trend', 'streams', 'track_id', 'danceability',
'energy', 'key', 'loudness', 'mode', 'speechiness', 'acousticness', 'instrumentalness', 'liveness', 'valence', 'tempo',
'duration_ms', 'time_signature']
for date_chosen in validated_dates:
cursor.execute("""SELECT * FROM charts WHERE Date("date") = ?""", (date_chosen,))
db_result = cursor.fetchall()
data_with_track_ids = []
final_data = []
# Splits ID from Spotify link.
for entry in db_result:
track_id = entry[4].split('/')[-1]
entry += (track_id,)
data_with_track_ids.append(entry)
print("I've got all the track IDs. Will start calls to Spotify API now.")
# Calls to spotify with the new extracted track_id
for entry in data_with_track_ids:
track_id = entry[-1]
try:
audio_features = spotify.audio_features(track_id)
except ReadTimeout:
print('Spotify timed out... trying again...')
audio_features = spotify.audio_features(track_id)
entry += (audio_features[0]['danceability'], audio_features[0]['energy'], audio_features[0]['key'],
audio_features[0]['loudness'], audio_features[0]['mode'], audio_features[0]['speechiness'], audio_features[0]['acousticness'],
audio_features[0]['instrumentalness'], audio_features[0]['liveness'],
audio_features[0]['valence'], audio_features[0]['tempo'], audio_features[0]['duration_ms'], audio_features[0]['time_signature'])
final_data.append(entry)
np_data = np.array(final_data)
my_dataframe = pd.DataFrame(np_data, columns=column_names)
my_dataframe.to_csv(f'spotify_csv_data/spotify_top_200 {date_chosen}.csv')
now_end = datetime.now()
end_time = now_end.strftime("%H:%M:%S")
print(f'Main Function - Start time: {start_time}. End time: {end_time}.')
print(f'The date {date_chosen} took {now_end - now_start} to run.')
db.close()
if __name__ == "__main__":
now_start = datetime.now()
start_time = now_start.strftime("%H:%M:%S")
print(f'Script - start time: {start_time}')
os.environ['SPOTIPY_CLIENT_ID'] = 'ENTER YOUR CLIENT_ID'
os.environ['SPOTIPY_CLIENT_SECRET'] = 'ENTER YOUR CLIENT_SECRET'
# Allows for retries. Seems to be enough that it doesn't crash.
spotify = spotipy.Spotify(client_credentials_manager=SpotifyClientCredentials(), requests_timeout=10, retries=10)
"""Leave above set."""
main()
now_end = datetime.now()
end_time = now_end.strftime("%H:%M:%S")
print(f'Script - Start time: {start_time}. End time: {end_time}.')
print(f'This script took {now_end - now_start} to run.\n')
A few ideas to improve performance:
Use parallel processing
Since you are using Python, the code running is single threaded.
Using Python's multiprocessing library, you could (for example) run 4 instances of the same code but with evenly divided start/end dates. This can make your data processing ~4x faster. You would just need to write the data in such a way that there is no overlap.
Note: If you are rate limited by the Spotify API (you most likely will be), you can use different API keys for each instance. (Make multiple accounts or borrow a friends API key).
Sql query optimizations
It's worth looking into your queries to see what is going wrong. I'm personally not familiar with SQL, just giving you ideas.
Profile your program to understand more.
See How can you profile a Python script?
Use some sort of caching technique to avoid redundant api calls and to avoid populating duplicate data. (See a potential solution below, in last block of code using ids_seen)
python3
# Splits ID from Spotify link.
for entry in db_result:
track_id = entry[4].split('/')[-1]
entry += (track_id,)
data_with_track_ids.append(entry)
In this code, what type is entry? How big is db_result?
Another thing worth mentioning regarding your following code:
python3
# Calls to spotify with the new extracted track_id
for entry in data_with_track_ids:
track_id = entry[-1]
try:
audio_features = spotify.audio_features(track_id)
except ReadTimeout:
print('Spotify timed out... trying again...')
audio_features = spotify.audio_features(track_id)
entry += (audio_features[0]['danceability'], audio_features[0]['energy'], audio_features[0]['key'],
audio_features[0]['loudness'], audio_features[0]['mode'], audio_features[0]['speechiness'], audio_features[0]['acousticness'],
audio_features[0]['instrumentalness'], audio_features[0]['liveness'],
audio_features[0]['valence'], audio_features[0]['tempo'], audio_features[0]['duration_ms'], audio_features[0]['time_signature'])
final_data.append(entry)
In the try-except block, you are making a request for every entry in data_with_track_ids. How many elements in the data_with_track_ids data structure? Expect to be throttled and timed out by Spotify servers if you brute force api calls.
You should add a short wait period after timing out to reduce chances
of getting rate-limited or IP banned. Oh wait, looks like when you initialize the spotify variable, the retries are set and handled automatically under the hood in spotipy source code.
EDIT
Here is a way you can avoid making redundant requests by using Python's set data structure. This can serve as your "cache":
# Calls to spotify with the new extracted track_id
ids_seen = set()
for entry in data_with_track_ids:
track_id = entry[-1]
if track_id not in ids_seen:
try:
# retries are already built-in and defined in your __main__(), spotify variable
audio_features = spotify.audio_features(track_id)
except SpotifyException as se:
print('Spotify timed out...Maximum retries exceeded...moving on to next track_id...')
print("TRACK ID IS: {}".format(track_id))
print("Error details: {}".format(se))
ids_seen.add(track_id)
continue
# on success, add track id to ids seen
ids_seen.add(track_id)
else:
print("We have seen this ID before... ID = {}".format(track_id))
continue # skips the next 5 instructions and starts again at top of loop, next iteration
entry += (audio_features[0]['danceability'], audio_features[0]['energy'], audio_features[0]['key'],
audio_features[0]['loudness'], audio_features[0]['mode'], audio_features[0]['speechiness'], audio_features[0]['acousticness'],
audio_features[0]['instrumentalness'], audio_features[0]['liveness'],
audio_features[0]['valence'], audio_features[0]['tempo'], audio_features[0]['duration_ms'], audio_features[0]['time_signature'])
final_data.append(entry)
If you are limited to 1000 req/day, then simply sleep program for 24 hours or stop program (and save current iteration and data context), and run again after you are allowed more requests.
See https://developer.spotify.com/documentation/web-api/guides/rate-limits/
Profile, profile, profile. But the bottleneck is likely soptify's api. Whilst you can parallelise to speed up fetching, they won't thank you much for it and you will likely find yourself rate limited if you do it too much. So profile and see what is taking the time, but be prepared to cut back on your dataset.
Ask yourself what you can do to speed up the algorithm:
can you just fetch the top N hits?
do you really need all that data?
is any data duplicated?
Even if data isn't duplicated, create a local cache, indexed by the track_id, and store every request in that. Rather than requesting from the spotify endpoint, look it up in the cache (store the data in another sqlite database, or another table in the same db). If nothing is returned, fetch, save the data to the cache, and then return it. That way:
if you are doing redundant lookups, it will be faster.
even if you aren't, you can re-run your code blazingly fast (at least as regards your current speed) if you change something and need to run the lot again.
So cache, profile, and look at your algorithm.
You're calling spotify.audio_features(track_id) for every single track, even if you've already fetched its data. Each Friday's results should only introduce a few new songs, but you're re-fetching information on all 200. Don't do that. Make another database table for song info. After you've fetched a track_id's info, write it to the database. Before fetching a track_id's info, see if you've already stored it in the database. Then you'll only be making the bare minimum of necessary API calls, rather than 200 * num_weeks * num_countries.

Is there a way I can get the new column value based on previous rows without using loop in Python?

I have a data table which has two columns time in and time out as shown below.
TimeIn TimeOut
01:23AM 01:45AM
01:34AM 01:53AM
01:43AM 01:59AM
02:01AM 02:09AM
02:34AM 03:11AM
02:39AM 02:48AM
02:56AM 03:12AM
I need to create a third column named 'Counter' which updates in a way that when the TimeIn of ith occurance is more than TimeOut of (i-1)th then that counter remains same else increases to 1. Consider it as people assigned for task so if a person is free after his/her time out then he/she can take up the job. Also if at a particular instance more than one counter is free then I need to take the first of them which got free so the above table would look like this.
TimeIn TimeOut Counter
01:23AM 01:45AM 1
01:34AM 01:53AM 2
01:43AM 01:59AM 3
02:01AM 02:09AM 1 (in this case 1,2,3 all are also free but 1 became free first)
02:34AM 03:11AM 2 (in this case 1,2,3 all are also free but 2 became free first)
02:39AM 02:48AM 3 (in this case 1 is also free but 3 became free first)
02:56AM 03:12AM 1 (in this case 3 is also free but 1 became free first)
I was hoping if there could be a way in pandas to do it without loop since my database could be large but please let me know even if there is a way where it could be achieved efficiently using a loop as well should be fine.
Many thanks in advance.
I couldn't figure out an efficient way with native Pandas-methods. But if I'm not completely mistaken, a heap queue seems to be an adequate tool for the problem.
With
df =
TimeIn TimeOut
0 01:23AM 01:45AM
1 01:34AM 01:53AM
2 01:43AM 01:59AM
3 02:01AM 02:09AM
4 02:34AM 03:11AM
5 02:39AM 02:48AM
6 02:56AM 03:12AM
and
for col in ("TimeIn", "TimeOut"):
df[col] = pd.to_datetime(df[col])
this
from heapq import heappush, heappop
w_count = 1
counter = [1]
heap = []
w_time_out, w = df.TimeOut[0], 1
for time_in, time_out in zip(
df.TimeIn.tolist()[1:], df.TimeOut.tolist()[1:]
):
if time_in > w_time_out:
heappush(heap, (time_out, w))
counter.append(w)
w_time_out, w = heappop(heap)
else:
w_count += 1
counter.append(w_count)
if time_out > w_time_out:
heappush(heap, (time_out, w_count))
else:
heappush(heap, (w_time_out, w))
w_time_out, w = time_out, w_count
produces the counter-list
[1, 2, 3, 1, 2, 3, 1]
Regarding your input data: You don't have complete timestamps, so pd.to_datetime uses the current day as date part. So if the range of your times isn't contained in one day you'll run into trouble.
EDIT: Fixed a mistake in the last else-branch.
For the sake of completeness, I'm including a pandas/numpy based solution. Performance is roughly 3x better (I saw 12s vs 34s for 10 million records) than the heapq based one, but implementation is significantly harder to follow. Unless you really need the performance, I'd recommend #Timus solution.
The idea here is:
We identify sessions where we have to increment the counter. We can immediately assign counter values to these sessions.
For the remaining sessions, we create a sequence of sessions that the same worker handles. We can then map any session to a "root session" where the worker was created.
To accomplish step (2):
We get two lists of session IDs, one sorted by start time and the other end time.
Pair each session start with the least recent session end. This corresponds to the earliest available worker taking on the next incoming request.
Work up the tree to map any given session to the first session handled by that worker.
# setup
text = StringIO(
"""
TimeIn TimeOut
01:23AM 01:45AM
01:34AM 01:53AM
01:43AM 01:59AM
02:01AM 02:09AM
02:34AM 03:11AM
02:39AM 02:48AM
02:56AM 03:12AM
""".strip()
)
sessions = pd.read_csv(text, sep=" ", parse_dates=["TimeIn", "TimeOut"])
# transform the data from wide format to long format
# event_log has the following columns:
# - Session: corresponding to the index of the input data
# - EventType: either TimeIn or TimeOut
# - EventTime: the event's time value
event_log = pd.melt(
sessions.rename_axis(index="Session").reset_index(),
id_vars=["Session"],
value_vars=["TimeIn", "TimeOut"],
var_name="EventType",
value_name="EventTime",
)
# sort the entire log by time
event_log.sort_values("EventTime", inplace=True, kind="mergesort")
# concurrency is the number of active workers at the time of that log entry
concurrency = event_log["EventType"].replace({"TimeIn": 1, "TimeOut": -1}).cumsum()
# new workers occur when the running maximum concurrency increases
new_worker = concurrency.cummax().diff().astype(bool)
new_worker_sessions = event_log.loc[new_worker, "Session"]
root_session = np.empty_like(sessions.index)
root_session[new_worker_sessions] = new_worker_sessions
# we could use the `sessions` DataFrame to avoid searching, but we'd need to sort on TimeOut
new_session = event_log.query("~#new_worker & (EventType == 'TimeIn')")["Session"]
old_session = event_log.query("~#new_worker & (EventType == 'TimeOut')")["Session"]
# Pair each session start with the session that ended least recently
root_session[new_session] = old_session[: new_session.shape[0]]
# Find the root session
# maybe something can be optimized here?
while not np.array_equal((_root_session := root_session.take(root_session)), root_session):
root_session = _root_session
counter = np.empty_like(root_session)
counter[new_worker_sessions] = np.arange(start=1, stop=new_worker_sessions.shape[0] + 1)
sessions["Counter"] = counter.take(root_session)
Quick bit of code to generate more fake data:
N = 10 ** 6
start = pd.Timestamp("2021-08-12T01:23:00")
_base = pd.date_range(start=start, periods=N, freq=pd.Timedelta(1, "seconds"))
time_in = (
_base.values
+ np.random.exponential(1000, size=N).astype("timedelta64[ms]")
+ np.random.exponential(10000, size=N).astype("timedelta64[ns]")
+ np.timedelta64(1, "ms")
)
time_out = (
time_in
+ np.random.exponential(10, size=N).astype("timedelta64[s]")
+ np.random.exponential(1000, size=N).astype("timedelta64[ms]")
+ np.random.exponential(10000, size=N).astype("timedelta64[ns]")
+ np.timedelta64(1, "s")
)
sessions = (
pd.DataFrame({"TimeIn": time_in, "TimeOut": time_out})
.sort_values("TimeIn")
.reset_index(drop=True)
)

How to process this data to create a stacked bar chart?

so I am writing a program that monitors and records your usage time of foreground applications and saves them in a SQL database. I then want to retrieve the data from previous days and compile it all together into a stacked bar chart. Here, the x-axis will have the different days over which usage was recorded, and the various stacks in each bar will represent each app that was used.
In my program, I created 2 tables, one to record each day's app usage (with each new day's data having a different primary key id), and another table to record the primary key for each day.
Table 1:
_id
Application
usage_time
0
Google Chrome
245.283942928347
1
PyCharm
450.3939754962921
1
SQLiteStudio
140.2376308441162
1
Google Chrome
5.008131980896
Table 2:
Date
daily_id
2021-07-18 07:25:25.376734
0
2021-07-18 07:27:57.419574
1
Within my stacked bar chart program, I have come up with this code to refine the data to put into the stacked bar chart:
conn = sqlite3.connect('daily_usage_monitor.sqlite', detect_types=sqlite3.PARSE_DECLTYPES)
all_app_data = conn.execute('SELECT all_usage_information.date, monitor.application, monitor.usage_time FROM all_usage_information INNER JOIN monitor ON all_usage_information.daily_id = monitor._id ORDER BY all_usage_information.date, monitor.usage_time ASC').fetchall()
for date, app, usage_time in all_app_data:
print(f'{date} - {app}: {usage_time}')
conn.close()
daily_data = {}
# Create nested dictionary - key = each date, value = dictionary of different apps & their time usage durations
for date, app, time in all_app_data:
conditions = [date not in daily_data, app != 'loginwindow']
if all(conditions):
daily_data[date] = {app: time}
elif not conditions[0] and conditions[1]:
daily_data[date].update({app: time})
print(daily_data) # TODO: REMOVE AFTER TESTING
total_time = 0
# Club any applications that account for <5% of total time into 1 category called 'Other'
for date, app_usages in daily_data.items():
total_time = sum(time for app, time in app_usages.items())
refined_data = {}
for key, value in app_usages.items():
if value/total_time < 0.05:
refined_data['Others'] = refined_data.setdefault('Others', 0) + value
else:
refined_data[key] = value
daily_data[date] = refined_data
print(daily_data) # TODO: REMOVE AFTER TESTING
# Takes the nested dictionary and breaks it into a labels list and a dictionary with apps & time usages for each day
# Sorts data so it can be used to create composite bar chart
final_data = {}
labels = []
for date, app_usages in daily_data.items():
labels.append(date)
for app, time in app_usages.items():
if app not in final_data:
final_data[app] = [time]
else:
final_data[app].append(time)
This is the kind of output I am currently getting:
{'Google Chrome': [245.283942928347, 190.20031905174255], 'SQLiteStudio': [145.24058270454407], 'PyCharm': [1166.0021023750305]}
The problem here is that for the days where an application had 0 usage time, it is not being recorded in the list. Therefore, the insertion order of the stacked bar chart will not be correct and will show the wrong apps for the wrong dates. How can I fix this?
This is one method I tried, but of course it's not working because you cannot index into a dictionary:
for app, usage in final_data.items():
for date, app_usages in daily_data.items():
if app not in app_usages:
usage.insert(app_usages.index(app), 0)

Categories