Calculate time difference between all dates in column python - python

I have a data frame that looks like that:
group date value
g_1 1/2/2019 11:03:00 3
g_1 1/2/2019 11:04:00 5
g_1 1/2/2019 10:03:32 100
g_2 4/3/2019 09:11:09 46
I want to calculate the time difference between occurrences (in seconds) per group.
Example output:
groups_time_diff = {'g_1': [23,5666,7878], 'g_2: [0.2,56,2343] ,...}
This is my code:
groups_time_diff = defaultdict(list)
for group in tqdm(groups):
group_df = unit_df[unit_df['group'] == group]
dates = list(group_df['time'])
while len(dates) != 0:
min_date = min(dates)
dates.remove(min_date)
if len(dates) > 0:
second_min_date = min(dates)
date_diff = second_min_date - min_date
groups_time_diff[group].append(date_diff.seconds)
This takes forever to run and I am looking for a more time efficient way to get the desired output.
Any ideas?

Try this:
sorted_group_df = group_df.sort_values(by='time',ascending=True)
dates = sorted_group_df['time']
one = dates[1:-1].reset_index(drop=True)
two = dates[0:-1].reset_index(drop=True)
date_difference = one - two
date_difference_in_seconds = date_difference.dt.seconds

Try at first sort your dates. Then subtract these two series:
dates = dates.sort_values()
pd.Series.subtract(dates[0:-1], dates[1:-1])
You are using min function twice in each iteration that is not efficient.
Hope this helps.

Related

Speeding up a Python datetime comparison to generate values

I want to do this in a pythonic way without using 1) a nested if statement and 2) the use of iterrows.
I have columns
Date in | Date Out | 1/22 | 2/22 | ... | 12/22
1/1/19 5/5/22
5/5/22 7/7/22
for columns like '1/22', I want to insert a calculated value which would be one of the following:
Not Created Yet
Closed
Open
For the first row, column 1/22 would read "Open" because it was opened in Jan/22. This would continue until column 5/22, in which it would be labeled "Closed."
For the second row, column 1/22 would read "Not Created Yet" until 5/22 which would read "Open" until 7/22 which would have the value "Closed."
I don't need the full table necessarily, but I want to get a count of how many are closed/open/not created yet for every month.
Here is the code I'm using which works, but just takes longer than I think it could:
table={}
for i in mcLogsClose.iterrows():
table[i[0]] = {}
for month in pd.date_range(start='9/2021', end='9/2022', freq='M'):
if i[1]['Notif Date'] <= month:
if i[1]['Completion Date'] <= month:
table[i[0]][month]="Closed"
else:
table[i[0]][month]="Open"
else:
table[i[0]][month]="Not Yet Created"
I then want to run table['1/22'].value_counts()
Thanks for your attention!
1. Using a loop
# The date range you are calculating for
min_date = pd.Period("2022-01")
max_date = pd.Period("2022-12")
span = (max_date - min_date).n + 1
# Strip the "Date In" and "Date Out" columns down to the month
date_in = pd.to_datetime(df["Date In"]).dt.to_period("M")
date_out = pd.to_datetime(df["Date Out"]).dt.to_period("M")
data = []
for d_in, d_out in zip(date_in, date_out):
if d_in > max_date:
# If date in is after max date, the whole span is under "Not Created" status
data.append((span, 0, 0))
elif d_out < min_date:
# If date out is before min date, the whole span is under "Closed" status
data.append((0, span, 0))
else:
# Now that we have some overlap between (d_in, d_out) and (min_date,
# max_date), we need to calculate time spent in each status
closed = (max_date - min(d_out, max_date)).n
not_created = (max(d_in, min_date) - min_date).n
open_ = span - closed - not_created
data.append((not_created, closed, open_))
cols = ["Not Created Yet", "Closed", "Open"]
df[cols] = pd.DataFrame(data, columns=cols, index=df.index)
2. Using numpy
def to_n(arr: np.array) -> np.array:
"""Convert an array of pd.Period to array of integers"""
return np.array([i.n for i in arr])
# The date range you are calculating for. Since we intend to use vectorized
# code, we need to turn them into numpy arrays
min_date = np.repeat(pd.Period("2022-01"), len(df))
max_date = np.repeat(pd.Period("2022-12"), len(df))
span = to_n(max_date - min_date) + 1
date_in = pd.to_datetime(df["Date In"]).dt.to_period("M")
date_out = pd.to_datetime(df["Date Out"]).dt.to_period("M")
df["Not Created Yet"] = np.where(
date_in > max_date,
span,
to_n(np.max([date_in, min_date], axis=0) - min_date),
)
df["Closed"] = np.where(
date_out < min_date,
span,
to_n(max_date - np.min([date_out, max_date], axis=0)),
)
df["Open"] = span - df["Not Created Yet"] - df["Closed"]
Result (with some rows added for my testing):
Date In Date Out Not Created Yet Closed Open
0 1/1/19 5/5/22 0 7 5
1 5/5/22 7/7/22 4 5 3
2 1/1/20 12/12/20 0 12 0
3 1/1/23 6/6/23 12 0 0
4 6/6/21 6/6/23 0 0 12

Finding closest timestamp between dataframe columns

I have two dataframes
import numpy as np
import pandas as pd
test1 = pd.date_range(start='1/1/2018', end='1/10/2018')
test1 = pd.DataFrame(test1)
test1.rename(columns = {list(test1)[0]: 'time'}, inplace = True)
test2 = pd.date_range(start='1/5/2018', end='1/20/2018')
test2 = pd.DataFrame(test2)
test2.rename(columns = {list(test2)[0]: 'time'}, inplace = True)
Now in first dataframe I create column
test1['values'] = np.zeros(10)
I want to fill this column, next to each date there should be the index of the closest date from second dataframe. I want it to look like this:
0 2018-01-01 0
1 2018-01-02 0
2 2018-01-03 0
3 2018-01-04 0
4 2018-01-05 0
5 2018-01-06 1
6 2018-01-07 2
7 2018-01-08 3
Of course my real data is not evenly spaced and has minutes and seconds, but the idea is same. I use the following code:
def nearest(items, pivot):
return min(items, key=lambda x: abs(x - pivot))
for k in range(10):
a = nearest(test2['time'], test1['time'][k]) ### find nearest timestamp from second dataframe
b = test2.index[test2['time'] == a].tolist()[0] ### identify the index of this timestamp
test1['value'][k] = b ### assign this value to the cell
This code is very slow on large datasets, how can I make it more efficient?
P.S. timestamps in my real data are sorted and increasing just like in these artificial examples.
You could do this in one line, using numpy's argmin:
test1['values'] = test1['time'].apply(lambda t: np.argmin(np.absolute(test2['time'] - t)))
Note that applying a lambda function is essentially also a loop. Check if that satisfies your requirements performance-wise.
You might also be able to leverage the fact that your timestamps are sorted and the timedelta between each timestamp is constant (if I got that correctly). Calculate the offset in days and derive the index vector, e.g. as follows:
offset = (test1['time'] - test2['time']).iloc[0].days
if offset < 0: # test1 time starts before test2 time, prepend zeros:
offset = abs(offset)
idx = np.append(np.zeros(offset), np.arange(len(test1['time'])-offset)).astype(int)
else: # test1 time starts after or with test2 time, use arange right away:
idx = np.arange(offset, offset+len(test1['time']))
test1['values'] = idx

What is the best way to compute a rolling (lag and lead) difference in sales?

I'm looking to add a field or two into my data set that represents the difference in sales from the last week to current week and from current week to the next week.
My dataset is about 4.5 million rows so I'm looking to find an efficient way of doing this, currently I'm getting into a lot of iteration and for loops and I'm quite sure I'm going about this the wrong way. but Im trying to write code that will be reusable on other datasets and there are situations where you might have nulls or no change in sales week to week (therefore there is no record)
The dataset looks like the following:
Store Item WeekID WeeklySales
1 1567 34 100.00
2 2765 34 86.00
3 1163 34 200.00
1 1567 35 160.00
. .
. .
. .
I have each week as its own dictionary and then each store sales for that week in a dictionary within. So I can use the week as a key and then within the week I access the store's dictionary of item sales.
weekly_sales_dict = {}
for i in df['WeekID'].unique():
store_items_dict = {}
subset = df[df['WeekID'] == i]
subset = subset.groupby(['Store', 'Item']).agg({'WeeklySales':'sum'}).reset_index()
for j in subset['Store'].unique():
storeset = subset[subset['Store'] == j]
store_items_dict.update({str(j): storeset})
weekly_sales_dict.update({ str(i) : store_items_dict})
Then I iterate through each week in the weekly_sales_dict and compare each store/item within it to the week behind it (I planned to do the same for the next week as well). The 'lag_list' I create can be indexed by week, store, and Item so I was going to iterate through and add the values to my df as a new lag column but I feel I am way overthinking this.
count = 0
key_list = list(df['WeekID'].unique())
lag_list = []
for k,v in weekly_sales_dict.items():
if count != 0 and count != len(df['WeekID'].unique())-1:
prev_wk = weekly_sales_dict[str(key_list[(count - 1)])]
current_wk = weekly_sales_dict[str(key_list[count])
for i in df['Store'].unique():
prev_df = prev_wk[str(i)]
current_df = current_wk[str(i)]
for j in df['Item'].unique():
print('in j')
if j in list(current_df['Item'].unique()) and j in list(prev_df['Item'].unique()):
item_lag = current_df[current_df['Item'] == int(j)]['WeeklySales'].values - prev_df[prev_df['Item'] == int(j)]['WeeklySales'].values
df[df['Item'] == j][df['Store'] == i ][df['WeekID'] == key_list[count]]['lag'] = item_lag[0]
lag_list.append((str(i),str(j),item_lag[0]))
elif j in list(current_df['Item'].unique()):
item_lag = current_df[current_df['Item'] == int(j)]['WeeklySales'].values
lag_list.append((str(i),str(j),item_lag[0]))
else:
pass
count += 1
else:
count += 1
Using pd.diff() the problem was solved. I sorted all rows by week, then created a subset with a multi-index by grouping on store,items,and week. Finally I used pd.diff() with a period of 1 and I ended up with the sales difference from the current week to the week prior.
df = df.sort_values(by = 'WeekID')
subset = df.groupby(['Store', 'Items', 'WeekID']).agg({''WeeklySales'':'sum'})
subset['lag'] = subset[['WeeklySales']].diff(1)

Group by date range in pandas dataframe

I have a time series data in pandas, and I would like to group by a certain time window in each year and calculate its min and max.
For example:
times = pd.date_range(start = '1/1/2011', end = '1/1/2016', freq = 'D')
df = pd.DataFrame(np.random.rand(len(times)), index=times, columns=["value"])
How to group by time window e.g. 'Jan-10':'Mar-21' for each year and calculate its min and max for column value?
You can use the resample method.
df.resample('5d').agg(['min','max'])
I'm not sure if there's a direct way to do it without first creating a flag for the days required. The following function is used to create a flag required:
# Function for flagging the days required
def flag(x):
if x.month == 1 and x.day>=10: return True
elif x.month in [2,3,4]: return True
elif x.month == 5 and x.day<=21: return True
else: return False
Since you require for each year, it would be a good idea to have the year as a column.
Then the min and max for each year for given periods can be obtained with the code below:
times = pd.date_range(start = '1/1/2011', end = '1/1/2016', freq = 'D')
df = pd.DataFrame(np.random.rand(len(times)), index=times, columns=["value"])
df['Year'] = df.index.year
pd.pivot_table(df[list(pd.Series(df.index).apply(flag))], values=['value'], index = ['Year'], aggfunc=[min,max])
The output will look like follows:
Sample Output
Hope that answers your question... :)
You can define the bin edges, then throw out the bins you don't need (every other) with .loc[::2, :]. Here I'll define two functions just to check we're getting the date ranges we want within groups (Note since left edges are open, need to subtract 1 day):
import pandas as pd
edges = pd.to_datetime([x for year in df.index.year.unique()
for x in [f'{year}-02-09', f'{year}-03-21']])
def min_idx(x):
return x.index.min()
def max_idx(x):
return x.index.max()
df.groupby(pd.cut(df.index, bins=edges)).agg([min_idx, max_idx, min, max]).loc[::2, :]
Output:
value
min_idx max_idx min max
(2011-02-09, 2011-03-21] 2011-02-10 2011-03-21 0.009343 0.990564
(2012-02-09, 2012-03-21] 2012-02-10 2012-03-21 0.026369 0.978470
(2013-02-09, 2013-03-21] 2013-02-10 2013-03-21 0.039491 0.946481
(2014-02-09, 2014-03-21] 2014-02-10 2014-03-21 0.029161 0.967490
(2015-02-09, 2015-03-21] 2015-02-10 2015-03-21 0.006877 0.969296
(2016-02-09, 2016-03-21] NaT NaT NaN NaN

Python Pandas: Count quarterly occurrence from start and end date range

I have a dataframe of jobs for different people with star and end time for each job. I'd like to count, every four months, how many jobs each person is responsible for. I figured out away to do it but I'm sure it's tremendously inefficient (I'm new to pandas). It takes quite a while to compute when I run the code on my complete dataset (hundreds of persons and jobs).
Here is what I have so far.
#create a data frame
import pandas as pd
import numpy as np
df = pd.DataFrame({'job': pd.Categorical(['job1','job2','job3','job4']),
'person': pd.Categorical(['p1', 'p1', 'p2','p2']),
'start': ['2015-01-01', '2015-06-01', '2015-01-01', '2016- 01- 01'],
'end': ['2015-07-01', '2015- 12-31', '2016-03-01', '2016-12-31']})
df['start'] = pd.to_datetime(df['start'])
df['end'] = pd.to_datetime(df['end'])
Which gives me
I then create a new dataset with
bdate = min(df['start'])
edate = max(df['end'])
dates = pd.date_range(bdate, edate, freq='4MS')
people = sorted(set(list(df['person'])))
df2 = pd.DataFrame(np.zeros((len(dates), len(people))), index=dates, columns=people)
for d in pd.date_range(bdate, edate, freq='MS'):
for p in people:
contagem = df[(df['person'] == p) &
(df['start'] <= d) &
(df['end'] >= d)]
pos = np.argmin(np.abs(dates - d))
df2.iloc[pos][p] = len(contagem.index)
df2
And I get
I'm sure there must be a better way of doing this without having to loop through all dates and persons. But how?
This answer assumes that each job-person combination is unique. It creates a series for every row with the value equal to the job an index that expands the dates. Then it resamples every 4th month (which is not quarterly but what your solution describes) and counts the unique non-na occurrences.
def make_date_range(x):
return pd.Series(index=pd.date_range(x.start.values[0], x.end.values[0], freq='M'), data=x.job.values[0])
# Iterate through each job person combo and make an entry for each month with the job as the value
df1 = df.groupby(['job', 'person']).apply(make_date_range).unstack('person')
# remove outer level from index
df1.index = df1.index.droplevel('job')
# resample each month counting only unique values
df1.resample('4MS').agg(lambda x: len(x[x.notnull()].unique()))
Output
person p1 p2
2015-01-01 1 1
2015-05-01 2 1
2015-09-01 1 1
2016-01-01 0 2
2016-05-01 0 1
2016-09-01 0 1
And here is a long one line solution that iterates over every rows and creates a new dataframe and stacks all of them together via pd.concat and then resamples.
pd.concat([pd.DataFrame(index = pd.date_range(tup.start, tup.end, freq='4MS'),
data=[[tup.job]],
columns=[tup.person]) for tup in df.itertuples()])\
.resample('4MS').count()
And another one that is faster
df1 = pd.melt(df, id_vars=['job', 'person'], value_name='date').set_index('date')
g = df1.groupby([pd.TimeGrouper('4MS'), 'person'])['job']
g.agg('nunique').unstack('person', fill_value=0)

Categories