calculate datetime differenence over multiple rows in dataframe

calculate datetime differenence over multiple rows in dataframe - python

I have got a python related question about datetimes in a dataframe. I imported the following df via pd.read_csv()
datetime label d_time
0 2017-01-03 23:52:00
1 2017-01-03 23:53:00 A
2 2017-01-03 23:54:00 A
3 2017-01-03 23:55:00 A
4 2017-01-04 00:01:00
5 2017-01-04 00:02:00 B
6 2017-01-04 00:06:00 B
7 2017-01-04 00:09:00 B
8 2017-01-04 00:11:00 B
9 2017-01-04 00:12:00
10 2017-01-04 00:14:00
11 2017-01-04 00:16:00
12 2017-01-04 00:18:00 C
13 2017-01-04 00:20:00 C
14 2017-01-04 00:22:00
I would like to know the time difference over the rows that are labeled with A, B, C as in the following:
datetime label d_time
0 2017-01-03 23:52:00
1 2017-01-03 23:53:00 A 0:02
2 2017-01-03 23:54:00 A
3 2017-01-03 23:55:00 A
4 2017-01-04 00:01:00
5 2017-01-04 00:02:00 B 0:09
6 2017-01-04 00:06:00 B
7 2017-01-04 00:09:00 B
8 2017-01-04 00:11:00 B
9 2017-01-04 00:12:00
10 2017-01-04 00:14:00
11 2017-01-04 00:16:00
12 2017-01-04 00:18:00 C 0:02
13 2017-01-04 00:20:00 C
14 2017-01-04 00:22:00
So the d_time should be the total time difference over labeled rows. There are approx. 100 different labels, and they can vary from 1 to x in a row. This calculation has to be done for +1 million rows, so a loop will probably not work. Does anybody know how to do this? Thanks in advance.

Assuming the consecutive labels are all the same, and seperated by 1 nan
you can do something like this
idx = pd.Series(df[pd.isnull(df['label'])].index)
idx_begin = idx.iloc[:-1] + 1
idx_end = idx.iloc[1:] - 1
d_time = df.loc[idx_end, 'datetime'].reset_index(drop=True) - df.loc[idx_begin, 'datetime'].reset_index(drop=True)
d_time.index = idx_begin
df.loc[idx_begin, 'd_time'] = d_time
If your dataset looks different, you might look into different ways to get to idx_begin and idx_end, but this works for the dataset you posted
Multiple consecutive nans
If there are multiple consecutive nan-values, you can solve this by adding this to the end
df.loc[df[pd.isnull(df['label'])].index, 'd_time'] = None
Consecutive different labels
idx = df[(df['label'] != df['label'].shift(1)) & (pd.notnull(df['label']) | (pd.notnull(df['label'].shift(1))))].index
idx_begin = idx[:-1]
idx_end = idx[1:] -1
This marks different labels as different starts and beginnings. To make this work, you will need the df.loc[df[pd.isnull(df['label'])].index, 'd_time'] = None added to the end
The & (pd.notnull(df['label']) | (pd.notnull(df['label'].shift(1))) part is because None != None
Result
datetime label d_time
0 2017-01-03 23:52:00 NaN NaN
1 2017-01-03 23:53:00 A NaN
2 2017-01-03 23:54:00 A NaN
3 2017-01-03 23:52:00 NaN NaN
4 2017-01-03 23:53:00 B NaN
5 2017-01-03 23:54:00 B NaN
6 2017-01-03 23:55:00 NaN NaN
7 2017-01-03 23:56:00 NaN NaN
8 2017-01-03 23:57:00 NaN NaN
9 2017-01-04 00:02:00 A NaN
10 2017-01-04 00:06:00 A NaN
11 2017-01-04 00:09:00 A NaN
12 2017-01-04 00:02:00 B NaN
13 2017-01-04 00:06:00 B NaN
14 2017-01-04 00:09:00 B NaN
15 2017-01-04 00:11:00 NaN NaN
yields
datetime label d_time
0 2017-01-03 23:52:00 NaN NaT
1 2017-01-03 23:53:00 A 00:01:00
2 2017-01-03 23:54:00 A NaT
3 2017-01-03 23:52:00 NaN NaT
4 2017-01-03 23:53:00 B 00:01:00
5 2017-01-03 23:54:00 B NaT
6 2017-01-03 23:55:00 NaN NaT
7 2017-01-03 23:56:00 NaN NaT
8 2017-01-03 23:57:00 NaN NaT
9 2017-01-04 00:02:00 A 00:07:00
10 2017-01-04 00:06:00 A NaT
11 2017-01-04 00:09:00 A NaT
12 2017-01-04 00:02:00 B 00:07:00
13 2017-01-04 00:06:00 B NaT
14 2017-01-04 00:09:00 B NaT
15 2017-01-04 00:11:00 NaN NaT
Last Series
If the last row doesn't have a changed label compared to the one before it, the last series will not register.
You can prevent this by including this after the first line
if idx[-1] != df.index[-1]:
idx = idx.append(df.index[[-1]]+1)

If the datetimes are datetime objects (or pandas.TimeStamp) you can use this for-loop
a_rows = []
for row in df.itertuples():
if row.label == 'A':
a_rows.append(row)
elif a_rows:
d_time = a_rows[-1].datetime - a_rows[0].datetime
df.loc[a_rows[0].Index, 'd_time'] = d_time
a_rows = []
with this result
datetime label d_time
0 2017-01-03 23:52:00
1 2017-01-03 23:53:00 A 0 days 00:02:00
2 2017-01-03 23:54:00 A
3 2017-01-03 23:55:00 A
4 2017-01-04 00:01:00
5 2017-01-04 00:02:00 A 0 days 00:07:00
6 2017-01-04 00:06:00 A
7 2017-01-04 00:09:00 A
8 2017-01-04 00:11:00
You can later format the timedelta object if you want.
If the datetime column are strings you can easily convert em with df['datetime'] = pd.to_datetime(df['datetime'])

Related

Pandas: minute based column, need to add 15 seconds at each row

My dataframe looks like this:
1 2019-04-22 00:01:00
2 2019-04-22 00:01:00
3 2019-04-22 00:01:00
4 2019-04-22 00:01:00
5 2019-04-22 00:02:00
6 2019-04-22 00:02:00
7 2019-04-22 00:02:00
8 2019-04-22 00:02:00
9 2019-04-22 00:03:00
10 2019-04-22 00:03:00
11 2019-04-22 00:03:00
12 2019-04-22 00:03:00
As you can see there are four rows for each minute, what I would need is to add 15 secondes to each row so that it looks like this:
1 2019-04-22 00:01:00
2 2019-04-22 00:01:15
3 2019-04-22 00:01:30
4 2019-04-22 00:01:45
5 2019-04-22 00:02:00
6 2019-04-22 00:02:15
7 2019-04-22 00:02:30
8 2019-04-22 00:02:45
9 2019-04-22 00:03:00
10 2019-04-22 00:03:15
11 2019-04-22 00:03:30
12 2019-04-22 00:03:45
Any idea on how to proceed? I am not really good at datetime object so I am a bit stuck on that one... thank you in advance!

You can add timedeltas to datetimes column:
df['date'] += pd.to_timedelta(df.groupby('date').cumcount() * 15, unit='s')
print (df)
date
1 2019-04-22 00:01:00
2 2019-04-22 00:01:15
3 2019-04-22 00:01:30
4 2019-04-22 00:01:45
5 2019-04-22 00:02:00
6 2019-04-22 00:02:15
7 2019-04-22 00:02:30
8 2019-04-22 00:02:45
9 2019-04-22 00:03:00
10 2019-04-22 00:03:15
11 2019-04-22 00:03:30
12 2019-04-22 00:03:45
Detail:
First create counter Series by GroupBy.cumcount:
print (df.groupby('date').cumcount())
1 0
2 1
3 2
4 3
5 0
6 1
7 2
8 3
9 0
10 1
11 2
12 3
dtype: int64
Multiple by 15 and convert to seconds timedeltas by to_timedelta:
print (pd.to_timedelta(df.groupby('date').cumcount() * 15, unit='s'))
1 00:00:00
2 00:00:15
3 00:00:30
4 00:00:45
5 00:00:00
6 00:00:15
7 00:00:30
8 00:00:45
9 00:00:00
10 00:00:15
11 00:00:30
12 00:00:45
dtype: timedelta64[ns]

How to change format of timedelta in pandas

import pandas as pd
import datetime
import numpy as np
from datetime import timedelta
def diff_func(row):
return (row['Timestamp'] - row['previous_end'])
dfMockLog = [ (1, ("2017-01-01 09:00:00"), "htt://x.org/page1.html"),
(1, ("2017-01-01 09:01:00"), "htt://x.org/page2.html"),
(1, ("2017-01-01 09:02:00"), "htt://x.org/page3.html"),
(1, ("2017-01-01 09:05:00"), "htt://x.org/page3.html"),
(1, ("2017-01-01 09:30:00"), "htt://x.org/page2.html"),
(1, ("2017-01-01 09:33:00"), "htt://x.org/page1.html"),
(1, ("2017-01-01 09:37:00"), "htt://x.org/page2.html"),
(1, ("2017-01-01 09:41:00"), "htt://x.org/page3.html"),
(1, ("2017-01-01 10:00:00"), "htt://x.org/page1.html"),
(1, ("2017-01-01 11:00:00"), "htt://x.org/page2.html"),
(2, ("2017-01-01 09:41:00"), "htt://x.org/page3.html"),
(2, ("2017-01-01 09:42:00"), "htt://x.org/page1.html"),
(2, ("2017-01-01 09:43:00"), "htt://x.org/page2.html")]
dfMockLog = pd.DataFrame(dfMockLog, columns = ['user', 'Timestamp', 'url'])
dfMockLog['Timestamp'] = pd.to_datetime(dfMockLog['Timestamp'])
dfMockLog = dfMockLog.sort_values(['user','Timestamp'])
dfMockLog['previous_end'] = dfMockLog.groupby(['user'])['Timestamp'].shift(1)
dfMockLog['time_diff'] = dfMockLog.apply(diff_func, axis=1)
dfMockLog['cum_sum'] = dfMockLog['time_diff'].cumsum()
print(dfMockLog)
I need "timediff" column to be converted into seconds.And the "cum_sum" column should contain cumulative sum partitioned by "user". It will be great if one can share all possible format for timedelta.

You are close. The way I prefer is to make a new column containing time_diff in seconds via pd.Series.dt.seconds. Then use groupby.transform to extract the cumsum by user:
dfMockLog['time_diff_secs'] = dfMockLog['time_diff'].dt.seconds
dfMockLog['cum_sum'] = dfMockLog.groupby('user')['time_diff_secs'].transform('cumsum')
print(dfMockLog)
user Timestamp url previous_end \
0 1 2017-01-01 09:00:00 htt://x.org/page1.html NaT
1 1 2017-01-01 09:01:00 htt://x.org/page2.html 2017-01-01 09:00:00
2 1 2017-01-01 09:02:00 htt://x.org/page3.html 2017-01-01 09:01:00
3 1 2017-01-01 09:05:00 htt://x.org/page3.html 2017-01-01 09:02:00
4 1 2017-01-01 09:30:00 htt://x.org/page2.html 2017-01-01 09:05:00
5 1 2017-01-01 09:33:00 htt://x.org/page1.html 2017-01-01 09:30:00
6 1 2017-01-01 09:37:00 htt://x.org/page2.html 2017-01-01 09:33:00
7 1 2017-01-01 09:41:00 htt://x.org/page3.html 2017-01-01 09:37:00
8 1 2017-01-01 10:00:00 htt://x.org/page1.html 2017-01-01 09:41:00
9 1 2017-01-01 11:00:00 htt://x.org/page2.html 2017-01-01 10:00:00
10 2 2017-01-01 09:41:00 htt://x.org/page3.html NaT
11 2 2017-01-01 09:42:00 htt://x.org/page1.html 2017-01-01 09:41:00
12 2 2017-01-01 09:43:00 htt://x.org/page2.html 2017-01-01 09:42:00
time_diff time_diff_secs cum_sum
0 NaT NaN NaN
1 00:01:00 60.0 60.0
2 00:01:00 60.0 120.0
3 00:03:00 180.0 300.0
4 00:25:00 1500.0 1800.0
5 00:03:00 180.0 1980.0
6 00:04:00 240.0 2220.0
7 00:04:00 240.0 2460.0
8 00:19:00 1140.0 3600.0
9 01:00:00 3600.0 7200.0
10 NaT NaN NaN
11 00:01:00 60.0 60.0
12 00:01:00 60.0 120.0

Pandas fill forward and sum as you go

I have a sparse dataframe including dates of when inventory is bought or sold like the following:
Date Inventory
2017-01-01 10
2017-01-05 -5
2017-01-07 15
2017-01-09 -20
First step I would like to solve is to to add in the other dates. I know you can use resample but just highlighting this part in case it has an impact on the next more difficult part. As below:
Date Inventory
2017-01-01 10
2017-01-02 NaN
2017-01-03 NaN
2017-01-04 NaN
2017-01-05 -5
2017-01-06 NaN
2017-01-07 15
2017-01-08 NaN
2017-01-09 -20
The final step is to have it fill forward over the NaNs except that once it encounters a new value that get added to the current value of the row above, so that the final dataframe looks like the following:
Date Inventory
2017-01-01 10
2017-01-02 10
2017-01-03 10
2017-01-04 10
2017-01-05 5
2017-01-06 5
2017-01-07 20
2017-01-08 20
2017-01-09 0
2017-01-10 0
I am trying to get a pythonic approach to this and not a loop based approach as that will be very slow.
The example should also work for a table with multiple columns as such:
Date InventoryA InventoryB
2017-01-01 10 NaN
2017-01-02 NaN NaN
2017-01-03 NaN 5
2017-01-04 NaN 5
2017-01-05 -5 NaN
2017-01-06 NaN -10
2017-01-07 15 NaN
2017-01-08 NaN NaN
2017-01-09 -20 NaN
would become:
Date InventoryA InventoryB
2017-01-01 10 0
2017-01-02 10 0
2017-01-03 10 5
2017-01-04 10 10
2017-01-05 5 10
2017-01-06 5 0
2017-01-07 20 0
2017-01-08 20 0
2017-01-09 0 0
2017-01-10 0 0
hope that helps too. I think the current solution will have a problem with the nans as such.
thanks

You can just fill the missing values with 0 after resampling (no inventory change on that day), and then use cumsum
df.fillna(0).cumsum()

You're simply doing the two steps in the wrong order :)
df['Inventory'].cumsum().resample('D').pad()
Edit: you might need to set the Date as index first.
df = df.set_index('Date')

Part 1 : Assuming df is your
Date Inventory
2017-01-01 10
2017-01-05 -5
2017-01-07 15
2017-01-09 -20
Then
import pandas as pd
import datetime
df_new = pd.DataFrame([df.Date.min() + datetime.timedelta(days=day) for day in range((df.Date.max() - df.Date.min()).days+1)])
df_new = df_new.merge(df, left_on=0, right_on='Date',how="left").drop("Date",axis=1)
df_new.columns = df.columns
Gives you :
Date Inventory
0 2017-01-01 10.0
1 2017-01-02 NaN
2 2017-01-03 NaN
3 2017-01-04 NaN
4 2017-01-05 -5.0
5 2017-01-06 NaN
6 2017-01-07 15.0
7 2017-01-08 NaN
8 2017-01-09 -20.0
part 2
From fillna method descriptions:
method : {‘backfill’, ‘bfill’, ‘pad’, ‘ffill’, None}, default None
Method to use for filling holes in reindexed Series pad / ffill:
propagate last valid observation forward to next valid backfill /
bfill: use NEXT valid observation to fill gap
df_new.Inventory = df_new.Inventory.fillna(method="ffill")
Gives you
Date Inventory
0 2017-01-01 10.0
1 2017-01-02 10.0
2 2017-01-03 10.0
3 2017-01-04 10.0
4 2017-01-05 -5.0
5 2017-01-06 -5.0
6 2017-01-07 15.0
7 2017-01-08 15.0
8 2017-01-09 -20.0
You should be able to generalise it for more than one column once you understood how it can be done with one.

Efficiently serialising timestamps

I have a dataframe to analyse that has a column of dates as datetimes, and a column of hours as integers.
I would like to combine the two columns into a single timestamp field for some further analysis, but cannot find a way to do so quickly.
I have this code that works, but takes an inoordinate amount of time due to the length of the dataframe (~1m entries)
for i in range(len(my_df))
my_df['gen_timestamp'][i] = datetime.datetime.combine(my_df['date'][i],
datetime.time(my_df['hour'][i])
What I would like to do is to somehow convert the datetime type in my_df['date'] to an integer (say a timestamp in seconds) and the integer type in my_df['hour'], so that they can be quickly summed without the need for a laborious loop.
Worst case I then convert that integer back to a datetime in one go or just use seconds as my data type going forwards.
Thanks for any help.

IIUC you can construct a TimedeltaIndex and add this to your datetimes:
In [112]:
# sample data
df = pd.DataFrame({'date':pd.date_range(dt.datetime(2017,1,1), periods=10), 'hour':np.arange(10)})
df
Out[112]:
date hour
0 2017-01-01 0
1 2017-01-02 1
2 2017-01-03 2
3 2017-01-04 3
4 2017-01-05 4
5 2017-01-06 5
6 2017-01-07 6
7 2017-01-08 7
8 2017-01-09 8
9 2017-01-10 9
In [113]:
df['timestamp'] = df['date'] + pd.TimedeltaIndex(df['hour'], unit='h')
df
Out[113]:
date hour timestamp
0 2017-01-01 0 2017-01-01 00:00:00
1 2017-01-02 1 2017-01-02 01:00:00
2 2017-01-03 2 2017-01-03 02:00:00
3 2017-01-04 3 2017-01-04 03:00:00
4 2017-01-05 4 2017-01-05 04:00:00
5 2017-01-06 5 2017-01-06 05:00:00
6 2017-01-07 6 2017-01-07 06:00:00
7 2017-01-08 7 2017-01-08 07:00:00
8 2017-01-09 8 2017-01-09 08:00:00
9 2017-01-10 9 2017-01-10 09:00:00
So in your case I expect the following to work:
my_df['gen_timestamp'] = my_df['date'] + pd.TimedeltaIndex(my_df['hour'], unit='h')
this assumes that my_df['date'] is already Datetime if not convert first using my_df['date'] = pd.to_datetime(my_df['date'])

merge and sample two pandas time series

I have two time series. I would like to merge them and asfreq(*, method='pad') the result, restricted to the time range they share in common.
To illustrate, suppose I define A and B like this:
import datetime as dt
import numpy as np
import pandas as pd
A = pd.Series(np.arange(4), index=pd.date_range(dt.datetime(2017,1,4,10,0,0),
periods=4, freq=dt.timedelta(seconds=10)))
B = pd.Series(np.arange(6), index=pd.date_range(dt.datetime(2017,1,4,10,0,7),
periods=6, freq=dt.timedelta(seconds=3)))
So they look like:
# A
2017-01-04 10:00:00 0
2017-01-04 10:00:10 1
2017-01-04 10:00:20 2
2017-01-04 10:00:30 3
# B
2017-01-04 10:00:07 0
2017-01-04 10:00:10 1
2017-01-04 10:00:13 2
2017-01-04 10:00:16 3
2017-01-04 10:00:19 4
2017-01-04 10:00:22 5
I would like to compute something like:
# combine_and_asfreq(A, B, dt.timedelta(seconds=5))
# timestamp A B
2017-01-04 10:00:07 0 0
2017-01-04 10:00:12 1 1
2017-01-04 10:00:17 1 3
2017-01-04 10:00:22 2 5
How can I do this?

I am not exactly sure what you are asking but here is a somewhat convoluted method that first finds the overlapping time and creates a single column dataframe as the 'base' dataframe with the 5s timedelta.
get started by setting up dataframes properly
start = max(A.index.min(), B.index.min())
end = min(A.index.max(), B.index.max())
df_time = pd.DataFrame({'time': pd.date_range(start,end,freq='5s')})
df_A = A.reset_index()
df_B = B.reset_index()
df_A.columns = ['time', 'value']
df_B.columns = ['time', 'value']
Now we have the following three dataframes.
df_A
time value
0 2017-01-04 10:00:00 0
1 2017-01-04 10:00:10 1
2 2017-01-04 10:00:20 2
3 2017-01-04 10:00:30 3
df_B
time value
0 2017-01-04 10:00:07 0
1 2017-01-04 10:00:10 1
2 2017-01-04 10:00:13 2
3 2017-01-04 10:00:16 3
4 2017-01-04 10:00:19 4
5 2017-01-04 10:00:22 5
df_time
time
0 2017-01-04 10:00:07
1 2017-01-04 10:00:12
2 2017-01-04 10:00:17
3 2017-01-04 10:00:22
Use merge_asof to join all three
pd.merge_asof(pd.merge_asof(df_time, df_A, on='time'), df_B, on='time', suffixes=('_A', '_B'))
time value_A value_B
0 2017-01-04 10:00:07 0 0
1 2017-01-04 10:00:12 1 1
2 2017-01-04 10:00:17 1 3
3 2017-01-04 10:00:22 2 5

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

calculate datetime differenence over multiple rows in dataframe - python

Related

Pandas: minute based column, need to add 15 seconds at each row

How to change format of timedelta in pandas

Pandas fill forward and sum as you go

Efficiently serialising timestamps

merge and sample two pandas time series

Categories

Resources