I am a beginner in python and I am writing a script to save time for a data received from an algorithm
In the script I have an algorithm which accepts few parameters and returns the id of the data it detected. Below are few of its output:
S.No Data Id Time
1 0 2018-11-16 15:00:00
2 0, 1 2018-11-16 15:00:02
3 0, 1 2018-11-16 15:00:03
4 0, 1, 2 2018-11-16 15:00:05
5 0, 1, 2 2018-11-16 15:00:06
6 0, 2 2018-11-16 15:00:08
From the above output, we can see that at 1st attempt it detected data of id 0. In the 2nd attempt it detected the data of id 1 so total data id detected 0, 1. In 4th attempt, it detected 2 id. This keeps on going as it is running in while True. From above we can say that the time period for 0 id is 8 sec, for 1 time period is 4 sec and for 2 it is 3 sec. I need to calculate these time periods. For this I have written below code:
data_dict = {} # To store value of time for each data id
data_dict_sec = {} # To store value of seconds for each data id
data = amc_data_retention() # data contains the data id
for dataID in data.items():
if run_once:
run_once = False
data_dict[dataID] = time.time()
data_dict_sec[dataID] = 0
for dataID in data.items():
if dataID in data_dict:
sec = time.time() - data_dict[dataID]
data_dict_sec[dataID] += sec
data_dict[dataID] = time.time()
else:
print("New data detected")
The first for loop run once and it saves the value of time of dataID in dict. In the next for loop, that time is subtracted with current time and total seconds are saved in data_dict_sec. In first iteration, total seconds will be 0 but from the next iteration, it will start saving the correct seconds. This works fine only if there is 1 data id. As soon as 2nd data comes, it do not record time for it.
Can anyone please suggest a good way of writing this. Main objective is to keep saving the values of time period for each data id. Please help. Thanks
The only time this adds data_ID keys to the data_dict is in the first run. It should add each new data_ID is sees. I'm not seeing that the first for-loop is needed, where it only adds data_ID keys in the first run. It may do what you need if you moved the dictionary key initialization into the second for-loop where it checks if the data_ID is in the data_dict. If it isn't, then initialize it.
Perhaps this would do what you need:
data_dict = {} # To store value of time for each data id
data_dict_sec = {} # To store value of seconds for each data id
data = amc_data_retention() # data contains the data id
for dataID in data.items():
if dataID in data_dict:
sec = time.time() - data_dict[dataID]
data_dict_sec[dataID] += sec
data_dict[dataID] = time.time()
else:
print("New data detected")
data_dict[dataID] = time.time()
data_dict_sec[dataID] = 0
Related
I have a data table which has two columns time in and time out as shown below.
TimeIn TimeOut
01:23AM 01:45AM
01:34AM 01:53AM
01:43AM 01:59AM
02:01AM 02:09AM
02:34AM 03:11AM
02:39AM 02:48AM
02:56AM 03:12AM
I need to create a third column named 'Counter' which updates in a way that when the TimeIn of ith occurance is more than TimeOut of (i-1)th then that counter remains same else increases to 1. Consider it as people assigned for task so if a person is free after his/her time out then he/she can take up the job. Also if at a particular instance more than one counter is free then I need to take the first of them which got free so the above table would look like this.
TimeIn TimeOut Counter
01:23AM 01:45AM 1
01:34AM 01:53AM 2
01:43AM 01:59AM 3
02:01AM 02:09AM 1 (in this case 1,2,3 all are also free but 1 became free first)
02:34AM 03:11AM 2 (in this case 1,2,3 all are also free but 2 became free first)
02:39AM 02:48AM 3 (in this case 1 is also free but 3 became free first)
02:56AM 03:12AM 1 (in this case 3 is also free but 1 became free first)
I was hoping if there could be a way in pandas to do it without loop since my database could be large but please let me know even if there is a way where it could be achieved efficiently using a loop as well should be fine.
Many thanks in advance.
I couldn't figure out an efficient way with native Pandas-methods. But if I'm not completely mistaken, a heap queue seems to be an adequate tool for the problem.
With
df =
TimeIn TimeOut
0 01:23AM 01:45AM
1 01:34AM 01:53AM
2 01:43AM 01:59AM
3 02:01AM 02:09AM
4 02:34AM 03:11AM
5 02:39AM 02:48AM
6 02:56AM 03:12AM
and
for col in ("TimeIn", "TimeOut"):
df[col] = pd.to_datetime(df[col])
this
from heapq import heappush, heappop
w_count = 1
counter = [1]
heap = []
w_time_out, w = df.TimeOut[0], 1
for time_in, time_out in zip(
df.TimeIn.tolist()[1:], df.TimeOut.tolist()[1:]
):
if time_in > w_time_out:
heappush(heap, (time_out, w))
counter.append(w)
w_time_out, w = heappop(heap)
else:
w_count += 1
counter.append(w_count)
if time_out > w_time_out:
heappush(heap, (time_out, w_count))
else:
heappush(heap, (w_time_out, w))
w_time_out, w = time_out, w_count
produces the counter-list
[1, 2, 3, 1, 2, 3, 1]
Regarding your input data: You don't have complete timestamps, so pd.to_datetime uses the current day as date part. So if the range of your times isn't contained in one day you'll run into trouble.
EDIT: Fixed a mistake in the last else-branch.
For the sake of completeness, I'm including a pandas/numpy based solution. Performance is roughly 3x better (I saw 12s vs 34s for 10 million records) than the heapq based one, but implementation is significantly harder to follow. Unless you really need the performance, I'd recommend #Timus solution.
The idea here is:
We identify sessions where we have to increment the counter. We can immediately assign counter values to these sessions.
For the remaining sessions, we create a sequence of sessions that the same worker handles. We can then map any session to a "root session" where the worker was created.
To accomplish step (2):
We get two lists of session IDs, one sorted by start time and the other end time.
Pair each session start with the least recent session end. This corresponds to the earliest available worker taking on the next incoming request.
Work up the tree to map any given session to the first session handled by that worker.
# setup
text = StringIO(
"""
TimeIn TimeOut
01:23AM 01:45AM
01:34AM 01:53AM
01:43AM 01:59AM
02:01AM 02:09AM
02:34AM 03:11AM
02:39AM 02:48AM
02:56AM 03:12AM
""".strip()
)
sessions = pd.read_csv(text, sep=" ", parse_dates=["TimeIn", "TimeOut"])
# transform the data from wide format to long format
# event_log has the following columns:
# - Session: corresponding to the index of the input data
# - EventType: either TimeIn or TimeOut
# - EventTime: the event's time value
event_log = pd.melt(
sessions.rename_axis(index="Session").reset_index(),
id_vars=["Session"],
value_vars=["TimeIn", "TimeOut"],
var_name="EventType",
value_name="EventTime",
)
# sort the entire log by time
event_log.sort_values("EventTime", inplace=True, kind="mergesort")
# concurrency is the number of active workers at the time of that log entry
concurrency = event_log["EventType"].replace({"TimeIn": 1, "TimeOut": -1}).cumsum()
# new workers occur when the running maximum concurrency increases
new_worker = concurrency.cummax().diff().astype(bool)
new_worker_sessions = event_log.loc[new_worker, "Session"]
root_session = np.empty_like(sessions.index)
root_session[new_worker_sessions] = new_worker_sessions
# we could use the `sessions` DataFrame to avoid searching, but we'd need to sort on TimeOut
new_session = event_log.query("~#new_worker & (EventType == 'TimeIn')")["Session"]
old_session = event_log.query("~#new_worker & (EventType == 'TimeOut')")["Session"]
# Pair each session start with the session that ended least recently
root_session[new_session] = old_session[: new_session.shape[0]]
# Find the root session
# maybe something can be optimized here?
while not np.array_equal((_root_session := root_session.take(root_session)), root_session):
root_session = _root_session
counter = np.empty_like(root_session)
counter[new_worker_sessions] = np.arange(start=1, stop=new_worker_sessions.shape[0] + 1)
sessions["Counter"] = counter.take(root_session)
Quick bit of code to generate more fake data:
N = 10 ** 6
start = pd.Timestamp("2021-08-12T01:23:00")
_base = pd.date_range(start=start, periods=N, freq=pd.Timedelta(1, "seconds"))
time_in = (
_base.values
+ np.random.exponential(1000, size=N).astype("timedelta64[ms]")
+ np.random.exponential(10000, size=N).astype("timedelta64[ns]")
+ np.timedelta64(1, "ms")
)
time_out = (
time_in
+ np.random.exponential(10, size=N).astype("timedelta64[s]")
+ np.random.exponential(1000, size=N).astype("timedelta64[ms]")
+ np.random.exponential(10000, size=N).astype("timedelta64[ns]")
+ np.timedelta64(1, "s")
)
sessions = (
pd.DataFrame({"TimeIn": time_in, "TimeOut": time_out})
.sort_values("TimeIn")
.reset_index(drop=True)
)
I am new to python and pandas. I am trying to assign new session IDs for around 2270 users, based on the time difference between the timestamps. If the time difference exceeds 4 hours, I want a new session ID. Otherwise, it would have to remain the same. In the end, I want a modified data frame with the new session ID column. Here is what I have so far:
Eh_2016["NewSessionID"] = 1 #Initialize 'NewSessionID' column in df with 1
Eh_2016['elapsed'] = datetime.time(0,0,0,0) #Create an empty elapsed to calculate Time diff later
users = Eh_2016['Username'].unique() #find the number of unique Usernames
for user in users: #start of the loop
idx = Eh_2016.index[Eh_2016.Username == user] #Slice the original df
temp = Eh_2016[Eh_2016.Username == user] #Create a temp placeholder for the slice
counter = 1 # Initialize counter for NewSessionIDs
for i,t in enumerate(temp['Timestamp']): #Looping for each timestamp value
if i != 0 :
temp['elapsed'].iloc[i] = (t - temp['Timestamp'].iloc[i-1]) #Calculate diff btwn timestamps
if temp['elapsed'].iloc[i] > datetime.timedelta(hours = 4): #If time diff>4
counter +=1 #Increase counter value
temp['NewSessionID'].iloc[i]=counter #Assign new counter value as NewSessionID
else:
temp['NewSessionID'].iloc[i] = counter #Retain previous sessionID
Eh_2016.loc[idx,:]= temp #Replace original df with the updated slice
Any help on how to make this faster would be greatly appreciated! Let me know if you need more details. Thanks in advance.
Edit: Sample DF
Username Timestamp NewSessionID Elapsed
126842 1095513 2016-06-30 20:58:30.477 1 00:00:00
126843 1095513 2016-07-16 07:54:47.986 2 15 days 10:56:17.509000
126844 1095513 2016-07-16 07:54:47.986 2 0 days 00:00:00
126845 1095513 2016-07-16 07:55:10.986 2 0 days 00:00:23
126846 1095513 2016-07-16 07:55:13.456 2 0 days 00:00:02.470000
... ... ... ...
146920 8641894 2016-08-11 22:26:14.051 31 0 days 04:50:23.415000
146921 8641894 2016-08-11 22:26:14.488 31 0 days 00:00:00.437000
146922 8641894 2016-08-12 20:01:02.419 32 0 days 21:34:47.931000
146923 8641894 2016-08-23 10:19:05.973 33 10 days 14:18:03.554000
146924 8641894 2016-09-25 11:30:35.540 34 33 days 01:11:29.567000
Filtering the whole dataframe for each user is O(users*sessions), and it's not needed since you need to iterate over the whole thing anyway.
A more efficient approach would be to instead iterate over the dataframe in one pass, and store the temporary variables (counter, location of previous row, etc) in a separate dataframe indexed by users.
Eh_2016["NewSessionID"] = 1 #Initialize 'NewSessionID' column in df with 1
Eh_2016['elapsed'] = datetime.time(0,0,0,0) #Create an empty elapsed to calculate Time diff later
# create new dataframe of unique users
users = pd.DataFrame({'Username': Eh_2016['Username'].unique()}).set_index('Username')
# one column for the previous session looked at for each user
users['Previous'] = -1
# one column for the counter variable
users['Counter'] = 0
# iterate over each row
for index, row in Eh_2016.iterrows(): #start of the loop
user = row['Username']
previous = users[user, 'Previous']
if previous >= 0: # if this is not the first row for this user
Eh_2016.loc[index, 'elapsed'] = (row['Timestamp'] - Eh_2016.loc[previous, 'Timestamp']) #Calculate diff btwn timestamps
if Eh_2016.loc[index, 'elapsed'] > datetime.timedelta(hours = 4): #If time diff>4
users[user,'Counter'] += 1 #Increase counter value
Eh_2016.loc[index, 'NewSessionID'] = users[user,'Counter'] # Assign new counter value as NewSessionID
users[user, 'Previous'] = index # remember this row as the latest row for this user
I am trying to build my dataset of network recording for that I want to count all connections of a specific source in a specific minute and display the result in this form :
Time Source Number of Connections
2018-03-18 21:17:0 192.168.0.2 5
2018-03-18 21:18:00 192.168.0.2 1
2018-03-18 21:17:0 192.168.0.3 2
.....
I am using this code to do the count :
c
onnection_count = {} # dictionary that stores count of connections per minute
for s in source:
for x in edited_time :
if x in connection_count :
value = connection_count[x]
value = value + 1
connection_count[x] = value
else:
connection_count[x] = 1
new_count_df #count # date #source
but I don't know how to display it the way I want in fact I used this :
for s in source:
for x in edited_time :
new_count_df = (_time, source, connection_count)
but it doesn't display it the way I want.
I have a pandas dataframe df that looks like the following:
df
Out[16]:
Start End Value Start Realtime End Realtime Duration
0 0 2999 1 736051 736051 59.98
1 3000 104999 0 736051 736051 5639.98
For each row, I would need to check the Start Realtime and End Realtime column and if they are across one day (eg. Start Realtime[0] = 29-05-2016 22:30:00 and End Realtime[0]=30=05-2006 01:00:00 I should split the row in 2:
one from Start Realtime = 29-05-2016 22:30:00 until End Realtime = 29-05-2016 23:59:59
and
one from Start Realtime = 30-05-2016 00:00:00 until End Realtime = 30-05-2016 01:00:00
keeping the same value in the Value column and recalculating the duration (in seconds) and start and end columns (in samples)
It would be nice if I can keep the cut off time (in this example midnight) flexible..
Just take it row by row for starters. The idea is if you have a row you need to split, then return a dataframe with two rows; otherwise return a dataframe with one. And then append it on to the new dataframe you are creating.
expanded_df = pd.DataFrame()
for i, row in df.iterrows():
expanded_df = expanded_df.append(applyFunc(row), ignore_index=True)
For each row, create a cutoff time datetime object that is the closest to the start_time but after it. Then just see whether it falls between the start_time and end_time. Finally if it requires a split, create two new pandas series to return with the changed values.
def applyFunc(row):
start_time = datetime.datetime.fromtimestamp(row["Start Realtime"])
end_time = # Similar to above
custom_hour = 11
# custom_minute = ...
cutoff_time = # Start with datetime.datetime(start_time.year, start_time.month, start_time.day, custom_hour, 0, 0) and see how you need to adjust with datetime.timedelta(...)
if start_time < cutoff_time < end_time:
before_cutoff = # Logic for before_cutoff; you will probably find row.set_value("key", value) useful
after_cutoff = # Logic for after_cutoff series
return pd.DataFrame([before_cutoff,after_cutoff])
else:
return row
This is my first post so please be gentle. I have searched across the world wide web looking for a solution but I am yet to find one. The problem i'm trying to solve is as follows:
I have a dataset, comprised of 500.000+ samples, with 6 features per sample.
I have put this dataset in a multiindexed Pandas DataFrame
The first level of my dataFrame is the timeseries index, the second level is the ID. It looks as follows
Time id
2017-03-07 10:06:49.963241984 122.0 -7.024347
136.0 -11.664985
243.0 1.716150
2017-03-07 10:06:50.003462400 122.0 -7.025922
136.0 -11.671526
Every timestamp, a number of objects can be seen and are marked by label 'id'. For my application, i want to add a temporal dependency by including information
that happened 5 seconds ago, i.e. in this example on timestamp 10:06:45.
But, importantly, i only want to add this information if at that timestamp the object already existed (so if the id is equal).
I wanted to use the function dataframe.shift, as mentioned here and, i want to do it per level, so as indicated by user Unutbu in How do you shift Pandas DataFrame with a multiindex?
My question is as follows:
How do I append extra columns to the original dataframe X with information on what those objects were 5s ago. I would expect something like the following
X['x_location_shifted'] = X.groupby(level=1)['x_location'].shift(5*rate)
with the rate being 25Hz, ie. we shift 125 "DateTimeIndices", but, only if an object with id='...' exists at that timestamp.
EDIT:
The timestamps are not synchronized 100%, so the timegap is not always exactly equal to 0.04. Previously, i used np.argmin(np.abs(time-index)) to find the closest index to the stamp.
For example, in my set, at timestamp 2017-03-07 10:36:03.605008640 there is an object with id == 175 and location_x = 54.323.
id = 175
X.ix['2017-03-07 10:36:03.605008640', id] = 54.323
At timestamp 2017-03-07 10:36:08.604962560 ..... this object with id=175 has a location_x = 67.165955
id = 175
old_time = pd.to_datetime('2017-03-07 10:36:03.605008640')
new_time = old_time + pd.Timedelta('5 seconds')
# Finding the new value of location
X.ix[np.argmin(np.abs(new_time - X.index.get_level_values(0))), id]
So, finally, at timestep 10:36:08 i want to add the information of timestamp 10:36:03 IF the object already existed at that timestamp.
EDIT2:
After trying Maarten Fabré's solution, I came up with my own implementation, which you can find below. If anyone can show me a more pythonic way to do this, please let me know.
for current_time in X.index.get_level_values(0)[125:]:
#only do if there are objects at current time
if len(X.ix[current_time].index):
# Calculate past time
past_time = current_time - pd.Timedelta('5 seconds')
# Find index in X.index that is closest to this past time
past_time_index = np.argmin(np.abs(past_time-X.index.get_level_values(0)))
# translate the index back to a label
past_time = X.index[past_time_index][0]
# in that timestep, cycle the objects
for obj_id in X.ix[current_time].index:
# Try looking for the value box_center.x of obj obj_id 5s ago
try:
X.ix[(current_time, obj_id), 'box_center.x.shifted'] = X.ix[(past_time, obj_id), 'box_center.x']
X.ix[(current_time, obj_id), 'box_center.y.shifted'] = X.ix[(past_time, obj_id), 'box_center.y']
X.ix[(current_time, obj_id), 'relative_velocity.x.shifted'] = X.ix[(past_time, obj_id), 'relative_velocity.x']
X.ix[(current_time, obj_id), 'relative_velocity.y.shifted'] = X.ix[(past_time, obj_id), 'relative_velocity.y']
# If the key doesnt exist, the object doesn't exist, ergo the field should be np.nan
except KeyError:
X.ix[(current_time, obj_id), 'box_center.x.shift'] = np.nan
print('Timestep {}'.format(current_time))
If this is not enough information, please say so and I can add it :)
Cheers and thanks!
Assuming that you have no gaps in the timestamps, one possible solution might be the following, which creates a new index with shifted timestamps and uses that to get the 5 seconds-ago values for each ID.
offset = 5 * rate
# Create a shallow copy of the multiindex levels for modification
modified_levels = list(X.index.levels)
# Shift them
modified_times = pd.Series(modified_levels[0]).shift(offset)
# Fill NaNs with dummy values to avoid duplicates in the new index
modified_times[modified_times.isnull()] = range(sum(modified_times.isnull()))
modified_levels[0] = modified_times
new_index = X.index.set_levels(modified_levels, inplace=False)
X['x_location_shifted'] = X.loc[new_index, 'x_location'].values
If the timestamps are not 100% regular, then you'll either have to round the to the nearest 1/x second, or use a loop
you could use this as a loop
Data definition
import pandas as pd
import numpy as np
from io import StringIO
df_str = """
timestamp id location
10:00:00.005 1 a
10:00:00.005 2 b
10:00:00.005 3 c
10:00:05.006 2 a
10:00:05.006 3 b
10:00:05.006 4 c"""
df = pd.DataFrame.from_csv(StringIO(df_str), sep='\t').reset_index()
delta = pd.to_timedelta(5, unit='s')
margin = pd.to_timedelta(1/50, unit='s')
df['location_shifted'] = np.nan
Loop over the different id's
for label_id in set(df['id']):
df_id = df[df['id'] == label_id].copy() # copy to make sure we don't overwrite the original data. Might not be necessary
df_id['time_shift'] = df['timestamp'] + delta
for row in df_id.itertuples():
idx = row.Index
time_dif = abs(df['timestamp'] - row.time_shift)
shifted_locs = df_id[time_dif < margin ]
l = len(shifted_locs)
if l:
print(shifted_locs)
if l == 1:
idx_shift = shifted_locs.index[0]
else:
idx_shift = shifted_locs['time_shift'].idxmin()
df.loc[idx_shift, 'location_shifted'] = df_id.loc[idx, 'location']
Results
timestamp id location location_shifted
0 2017-05-09 10:00:00.005 1 a
1 2017-05-09 10:00:00.005 2 b
2 2017-05-09 10:00:00.005 3 c
3 2017-05-09 10:00:05.006 2 a b
4 2017-05-09 10:00:05.006 3 b c
5 2017-05-09 10:00:05.006 4 c
Any of you arriving here with the same question; i managed to solve it in a (minimal) vectorized way, but, it required me to return to a 3d panel.
3 Steps:
- make into 3D panel
- Add new columns
- Fill those columns
From a multi-index 2d frame it's possible to change it to a pandas.Panel where you convert the 2nd index to one of the axes in the panel.
After this I have a 3D panel with axes [time, objects, parameters]. Then, tranpose the panel to have the PARAMETERS as items, this to add columns to the datapanel. So, tranpose the panel, add the columns, transpose back.
dp_new = dp.transpose(2,0,1)
dp_new['shifted_box_center_x']=np.nan
dp_new['shifted_box_center_y']=np.nan
dp_new['shifted_relative_velocity_x']=np.nan
dp_new['shifted_relative_velocity_y']=np.nan
# tranpose them back to their original form
dp_new = dp_new.transpose(1,2,0)
Now that we have added the new fields, we can get their names by
new_fields = dp_new.minor_axis[-4:]
The objective is to add information from 5s ago, if that object existed. Therefore, we cycle the time series from a moment in time which is 5s. In my case, at a rate of 25Hz, this is element 5*rate = 125.
Lets first set the time to start from 5s in the datapanel
time = dp_new.items[125:]
Then, we iterate an enumerated version of the time. The enumeration will start at 0, which is the index of the datapanel at timestep = 0. The first timestep however is the timestep at time 0+5seconds.
time = dp_new.items[125:]
for iloc, ts in enumerate(time):
# Print progress
print('{} out of {}'.format(ts, dp.items[-1]) , end="\r", flush=True)
# Generate new INDEX field, by taking the field ID and dropping the NaN values
ids = dp_new.loc[ts].id.dropna().values
# Drop the nan field from the frame
dp_new[ts].dropna(thresh=5, inplace=True)
# save the original indices
original_index = {'index': dp_new.loc[ts].index, 'id': dp_new.loc[ts].id.values}
# set the index to field id
dp_new[ts].set_index(['id'], inplace=True)
# Check if the vector ids does NOT contain ALL ZEROS
if np.any(ids): # Check for all zeros
df_past = dp_new.iloc[iloc].copy() # SCREENSHOT AT TS=5s --> ILOC = 0
df_past.dropna(thresh=5, inplace=True) # drop the nan rows
df_past.set_index(['id'], inplace=True) # set the index to field ID
dp_new[ts].loc[original_index['id'], new_fields] = df_past[fields].values
This will only fill in fields that have id's ==ids.
This code was able to run on a 300 000 element file in about 5 minutes.
Note: i spent quite some time on this, mainly because of how one indexes a panel. At first , i thought calling the 3 dimensions would work, as stated in pandas help, but it seems that this is not the case.
dp_new[ts, ids, new_fields] = values does NOT work.