calculate datetime differenence over multiple rows in dataframe - python

I have got a python related question about datetimes in a dataframe. I imported the following df via pd.read_csv()
datetime label d_time
0 2017-01-03 23:52:00
1 2017-01-03 23:53:00 A
2 2017-01-03 23:54:00 A
3 2017-01-03 23:55:00 A
4 2017-01-04 00:01:00
5 2017-01-04 00:02:00 B
6 2017-01-04 00:06:00 B
7 2017-01-04 00:09:00 B
8 2017-01-04 00:11:00 B
9 2017-01-04 00:12:00
10 2017-01-04 00:14:00
11 2017-01-04 00:16:00
12 2017-01-04 00:18:00 C
13 2017-01-04 00:20:00 C
14 2017-01-04 00:22:00
I would like to know the time difference over the rows that are labeled with A, B, C as in the following:
datetime label d_time
0 2017-01-03 23:52:00
1 2017-01-03 23:53:00 A 0:02
2 2017-01-03 23:54:00 A
3 2017-01-03 23:55:00 A
4 2017-01-04 00:01:00
5 2017-01-04 00:02:00 B 0:09
6 2017-01-04 00:06:00 B
7 2017-01-04 00:09:00 B
8 2017-01-04 00:11:00 B
9 2017-01-04 00:12:00
10 2017-01-04 00:14:00
11 2017-01-04 00:16:00
12 2017-01-04 00:18:00 C 0:02
13 2017-01-04 00:20:00 C
14 2017-01-04 00:22:00
So the d_time should be the total time difference over labeled rows. There are approx. 100 different labels, and they can vary from 1 to x in a row. This calculation has to be done for +1 million rows, so a loop will probably not work. Does anybody know how to do this? Thanks in advance.

Assuming the consecutive labels are all the same, and seperated by 1 nan
you can do something like this
idx = pd.Series(df[pd.isnull(df['label'])].index)
idx_begin = idx.iloc[:-1] + 1
idx_end = idx.iloc[1:] - 1
d_time = df.loc[idx_end, 'datetime'].reset_index(drop=True) - df.loc[idx_begin, 'datetime'].reset_index(drop=True)
d_time.index = idx_begin
df.loc[idx_begin, 'd_time'] = d_time
If your dataset looks different, you might look into different ways to get to idx_begin and idx_end, but this works for the dataset you posted
Multiple consecutive nans
If there are multiple consecutive nan-values, you can solve this by adding this to the end
df.loc[df[pd.isnull(df['label'])].index, 'd_time'] = None
Consecutive different labels
idx = df[(df['label'] != df['label'].shift(1)) & (pd.notnull(df['label']) | (pd.notnull(df['label'].shift(1))))].index
idx_begin = idx[:-1]
idx_end = idx[1:] -1
This marks different labels as different starts and beginnings. To make this work, you will need the df.loc[df[pd.isnull(df['label'])].index, 'd_time'] = None added to the end
The & (pd.notnull(df['label']) | (pd.notnull(df['label'].shift(1))) part is because None != None
Result
datetime label d_time
0 2017-01-03 23:52:00 NaN NaN
1 2017-01-03 23:53:00 A NaN
2 2017-01-03 23:54:00 A NaN
3 2017-01-03 23:52:00 NaN NaN
4 2017-01-03 23:53:00 B NaN
5 2017-01-03 23:54:00 B NaN
6 2017-01-03 23:55:00 NaN NaN
7 2017-01-03 23:56:00 NaN NaN
8 2017-01-03 23:57:00 NaN NaN
9 2017-01-04 00:02:00 A NaN
10 2017-01-04 00:06:00 A NaN
11 2017-01-04 00:09:00 A NaN
12 2017-01-04 00:02:00 B NaN
13 2017-01-04 00:06:00 B NaN
14 2017-01-04 00:09:00 B NaN
15 2017-01-04 00:11:00 NaN NaN
yields
datetime label d_time
0 2017-01-03 23:52:00 NaN NaT
1 2017-01-03 23:53:00 A 00:01:00
2 2017-01-03 23:54:00 A NaT
3 2017-01-03 23:52:00 NaN NaT
4 2017-01-03 23:53:00 B 00:01:00
5 2017-01-03 23:54:00 B NaT
6 2017-01-03 23:55:00 NaN NaT
7 2017-01-03 23:56:00 NaN NaT
8 2017-01-03 23:57:00 NaN NaT
9 2017-01-04 00:02:00 A 00:07:00
10 2017-01-04 00:06:00 A NaT
11 2017-01-04 00:09:00 A NaT
12 2017-01-04 00:02:00 B 00:07:00
13 2017-01-04 00:06:00 B NaT
14 2017-01-04 00:09:00 B NaT
15 2017-01-04 00:11:00 NaN NaT
Last Series
If the last row doesn't have a changed label compared to the one before it, the last series will not register.
You can prevent this by including this after the first line
if idx[-1] != df.index[-1]:
idx = idx.append(df.index[[-1]]+1)

If the datetimes are datetime objects (or pandas.TimeStamp) you can use this for-loop
a_rows = []
for row in df.itertuples():
if row.label == 'A':
a_rows.append(row)
elif a_rows:
d_time = a_rows[-1].datetime - a_rows[0].datetime
df.loc[a_rows[0].Index, 'd_time'] = d_time
a_rows = []
with this result
datetime label d_time
0 2017-01-03 23:52:00
1 2017-01-03 23:53:00 A 0 days 00:02:00
2 2017-01-03 23:54:00 A
3 2017-01-03 23:55:00 A
4 2017-01-04 00:01:00
5 2017-01-04 00:02:00 A 0 days 00:07:00
6 2017-01-04 00:06:00 A
7 2017-01-04 00:09:00 A
8 2017-01-04 00:11:00
You can later format the timedelta object if you want.
If the datetime column are strings you can easily convert em with df['datetime'] = pd.to_datetime(df['datetime'])

Related

Pandas: minute based column, need to add 15 seconds at each row

My dataframe looks like this:
1 2019-04-22 00:01:00
2 2019-04-22 00:01:00
3 2019-04-22 00:01:00
4 2019-04-22 00:01:00
5 2019-04-22 00:02:00
6 2019-04-22 00:02:00
7 2019-04-22 00:02:00
8 2019-04-22 00:02:00
9 2019-04-22 00:03:00
10 2019-04-22 00:03:00
11 2019-04-22 00:03:00
12 2019-04-22 00:03:00
As you can see there are four rows for each minute, what I would need is to add 15 secondes to each row so that it looks like this:
1 2019-04-22 00:01:00
2 2019-04-22 00:01:15
3 2019-04-22 00:01:30
4 2019-04-22 00:01:45
5 2019-04-22 00:02:00
6 2019-04-22 00:02:15
7 2019-04-22 00:02:30
8 2019-04-22 00:02:45
9 2019-04-22 00:03:00
10 2019-04-22 00:03:15
11 2019-04-22 00:03:30
12 2019-04-22 00:03:45
Any idea on how to proceed? I am not really good at datetime object so I am a bit stuck on that one... thank you in advance!
You can add timedeltas to datetimes column:
df['date'] += pd.to_timedelta(df.groupby('date').cumcount() * 15, unit='s')
print (df)
date
1 2019-04-22 00:01:00
2 2019-04-22 00:01:15
3 2019-04-22 00:01:30
4 2019-04-22 00:01:45
5 2019-04-22 00:02:00
6 2019-04-22 00:02:15
7 2019-04-22 00:02:30
8 2019-04-22 00:02:45
9 2019-04-22 00:03:00
10 2019-04-22 00:03:15
11 2019-04-22 00:03:30
12 2019-04-22 00:03:45
Detail:
First create counter Series by GroupBy.cumcount:
print (df.groupby('date').cumcount())
1 0
2 1
3 2
4 3
5 0
6 1
7 2
8 3
9 0
10 1
11 2
12 3
dtype: int64
Multiple by 15 and convert to seconds timedeltas by to_timedelta:
print (pd.to_timedelta(df.groupby('date').cumcount() * 15, unit='s'))
1 00:00:00
2 00:00:15
3 00:00:30
4 00:00:45
5 00:00:00
6 00:00:15
7 00:00:30
8 00:00:45
9 00:00:00
10 00:00:15
11 00:00:30
12 00:00:45
dtype: timedelta64[ns]

How to change format of timedelta in pandas

import pandas as pd
import datetime
import numpy as np
from datetime import timedelta
def diff_func(row):
return (row['Timestamp'] - row['previous_end'])
dfMockLog = [ (1, ("2017-01-01 09:00:00"), "htt://x.org/page1.html"),
(1, ("2017-01-01 09:01:00"), "htt://x.org/page2.html"),
(1, ("2017-01-01 09:02:00"), "htt://x.org/page3.html"),
(1, ("2017-01-01 09:05:00"), "htt://x.org/page3.html"),
(1, ("2017-01-01 09:30:00"), "htt://x.org/page2.html"),
(1, ("2017-01-01 09:33:00"), "htt://x.org/page1.html"),
(1, ("2017-01-01 09:37:00"), "htt://x.org/page2.html"),
(1, ("2017-01-01 09:41:00"), "htt://x.org/page3.html"),
(1, ("2017-01-01 10:00:00"), "htt://x.org/page1.html"),
(1, ("2017-01-01 11:00:00"), "htt://x.org/page2.html"),
(2, ("2017-01-01 09:41:00"), "htt://x.org/page3.html"),
(2, ("2017-01-01 09:42:00"), "htt://x.org/page1.html"),
(2, ("2017-01-01 09:43:00"), "htt://x.org/page2.html")]
dfMockLog = pd.DataFrame(dfMockLog, columns = ['user', 'Timestamp', 'url'])
dfMockLog['Timestamp'] = pd.to_datetime(dfMockLog['Timestamp'])
dfMockLog = dfMockLog.sort_values(['user','Timestamp'])
dfMockLog['previous_end'] = dfMockLog.groupby(['user'])['Timestamp'].shift(1)
dfMockLog['time_diff'] = dfMockLog.apply(diff_func, axis=1)
dfMockLog['cum_sum'] = dfMockLog['time_diff'].cumsum()
print(dfMockLog)
I need "timediff" column to be converted into seconds.And the "cum_sum" column should contain cumulative sum partitioned by "user". It will be great if one can share all possible format for timedelta.
You are close. The way I prefer is to make a new column containing time_diff in seconds via pd.Series.dt.seconds. Then use groupby.transform to extract the cumsum by user:
dfMockLog['time_diff_secs'] = dfMockLog['time_diff'].dt.seconds
dfMockLog['cum_sum'] = dfMockLog.groupby('user')['time_diff_secs'].transform('cumsum')
print(dfMockLog)
user Timestamp url previous_end \
0 1 2017-01-01 09:00:00 htt://x.org/page1.html NaT
1 1 2017-01-01 09:01:00 htt://x.org/page2.html 2017-01-01 09:00:00
2 1 2017-01-01 09:02:00 htt://x.org/page3.html 2017-01-01 09:01:00
3 1 2017-01-01 09:05:00 htt://x.org/page3.html 2017-01-01 09:02:00
4 1 2017-01-01 09:30:00 htt://x.org/page2.html 2017-01-01 09:05:00
5 1 2017-01-01 09:33:00 htt://x.org/page1.html 2017-01-01 09:30:00
6 1 2017-01-01 09:37:00 htt://x.org/page2.html 2017-01-01 09:33:00
7 1 2017-01-01 09:41:00 htt://x.org/page3.html 2017-01-01 09:37:00
8 1 2017-01-01 10:00:00 htt://x.org/page1.html 2017-01-01 09:41:00
9 1 2017-01-01 11:00:00 htt://x.org/page2.html 2017-01-01 10:00:00
10 2 2017-01-01 09:41:00 htt://x.org/page3.html NaT
11 2 2017-01-01 09:42:00 htt://x.org/page1.html 2017-01-01 09:41:00
12 2 2017-01-01 09:43:00 htt://x.org/page2.html 2017-01-01 09:42:00
time_diff time_diff_secs cum_sum
0 NaT NaN NaN
1 00:01:00 60.0 60.0
2 00:01:00 60.0 120.0
3 00:03:00 180.0 300.0
4 00:25:00 1500.0 1800.0
5 00:03:00 180.0 1980.0
6 00:04:00 240.0 2220.0
7 00:04:00 240.0 2460.0
8 00:19:00 1140.0 3600.0
9 01:00:00 3600.0 7200.0
10 NaT NaN NaN
11 00:01:00 60.0 60.0
12 00:01:00 60.0 120.0

Pandas fill forward and sum as you go

I have a sparse dataframe including dates of when inventory is bought or sold like the following:
Date Inventory
2017-01-01 10
2017-01-05 -5
2017-01-07 15
2017-01-09 -20
First step I would like to solve is to to add in the other dates. I know you can use resample but just highlighting this part in case it has an impact on the next more difficult part. As below:
Date Inventory
2017-01-01 10
2017-01-02 NaN
2017-01-03 NaN
2017-01-04 NaN
2017-01-05 -5
2017-01-06 NaN
2017-01-07 15
2017-01-08 NaN
2017-01-09 -20
The final step is to have it fill forward over the NaNs except that once it encounters a new value that get added to the current value of the row above, so that the final dataframe looks like the following:
Date Inventory
2017-01-01 10
2017-01-02 10
2017-01-03 10
2017-01-04 10
2017-01-05 5
2017-01-06 5
2017-01-07 20
2017-01-08 20
2017-01-09 0
2017-01-10 0
I am trying to get a pythonic approach to this and not a loop based approach as that will be very slow.
The example should also work for a table with multiple columns as such:
Date InventoryA InventoryB
2017-01-01 10 NaN
2017-01-02 NaN NaN
2017-01-03 NaN 5
2017-01-04 NaN 5
2017-01-05 -5 NaN
2017-01-06 NaN -10
2017-01-07 15 NaN
2017-01-08 NaN NaN
2017-01-09 -20 NaN
would become:
Date InventoryA InventoryB
2017-01-01 10 0
2017-01-02 10 0
2017-01-03 10 5
2017-01-04 10 10
2017-01-05 5 10
2017-01-06 5 0
2017-01-07 20 0
2017-01-08 20 0
2017-01-09 0 0
2017-01-10 0 0
hope that helps too. I think the current solution will have a problem with the nans as such.
thanks
You can just fill the missing values with 0 after resampling (no inventory change on that day), and then use cumsum
df.fillna(0).cumsum()
You're simply doing the two steps in the wrong order :)
df['Inventory'].cumsum().resample('D').pad()
Edit: you might need to set the Date as index first.
df = df.set_index('Date')
Part 1 : Assuming df is your
Date Inventory
2017-01-01 10
2017-01-05 -5
2017-01-07 15
2017-01-09 -20
Then
import pandas as pd
import datetime
df_new = pd.DataFrame([df.Date.min() + datetime.timedelta(days=day) for day in range((df.Date.max() - df.Date.min()).days+1)])
df_new = df_new.merge(df, left_on=0, right_on='Date',how="left").drop("Date",axis=1)
df_new.columns = df.columns
Gives you :
Date Inventory
0 2017-01-01 10.0
1 2017-01-02 NaN
2 2017-01-03 NaN
3 2017-01-04 NaN
4 2017-01-05 -5.0
5 2017-01-06 NaN
6 2017-01-07 15.0
7 2017-01-08 NaN
8 2017-01-09 -20.0
part 2
From fillna method descriptions:
method : {‘backfill’, ‘bfill’, ‘pad’, ‘ffill’, None}, default None
Method to use for filling holes in reindexed Series pad / ffill:
propagate last valid observation forward to next valid backfill /
bfill: use NEXT valid observation to fill gap
df_new.Inventory = df_new.Inventory.fillna(method="ffill")
Gives you
Date Inventory
0 2017-01-01 10.0
1 2017-01-02 10.0
2 2017-01-03 10.0
3 2017-01-04 10.0
4 2017-01-05 -5.0
5 2017-01-06 -5.0
6 2017-01-07 15.0
7 2017-01-08 15.0
8 2017-01-09 -20.0
You should be able to generalise it for more than one column once you understood how it can be done with one.

Efficiently serialising timestamps

I have a dataframe to analyse that has a column of dates as datetimes, and a column of hours as integers.
I would like to combine the two columns into a single timestamp field for some further analysis, but cannot find a way to do so quickly.
I have this code that works, but takes an inoordinate amount of time due to the length of the dataframe (~1m entries)
for i in range(len(my_df))
my_df['gen_timestamp'][i] = datetime.datetime.combine(my_df['date'][i],
datetime.time(my_df['hour'][i])
What I would like to do is to somehow convert the datetime type in my_df['date'] to an integer (say a timestamp in seconds) and the integer type in my_df['hour'], so that they can be quickly summed without the need for a laborious loop.
Worst case I then convert that integer back to a datetime in one go or just use seconds as my data type going forwards.
Thanks for any help.
IIUC you can construct a TimedeltaIndex and add this to your datetimes:
In [112]:
# sample data
df = pd.DataFrame({'date':pd.date_range(dt.datetime(2017,1,1), periods=10), 'hour':np.arange(10)})
df
Out[112]:
date hour
0 2017-01-01 0
1 2017-01-02 1
2 2017-01-03 2
3 2017-01-04 3
4 2017-01-05 4
5 2017-01-06 5
6 2017-01-07 6
7 2017-01-08 7
8 2017-01-09 8
9 2017-01-10 9
In [113]:
df['timestamp'] = df['date'] + pd.TimedeltaIndex(df['hour'], unit='h')
df
Out[113]:
date hour timestamp
0 2017-01-01 0 2017-01-01 00:00:00
1 2017-01-02 1 2017-01-02 01:00:00
2 2017-01-03 2 2017-01-03 02:00:00
3 2017-01-04 3 2017-01-04 03:00:00
4 2017-01-05 4 2017-01-05 04:00:00
5 2017-01-06 5 2017-01-06 05:00:00
6 2017-01-07 6 2017-01-07 06:00:00
7 2017-01-08 7 2017-01-08 07:00:00
8 2017-01-09 8 2017-01-09 08:00:00
9 2017-01-10 9 2017-01-10 09:00:00
So in your case I expect the following to work:
my_df['gen_timestamp'] = my_df['date'] + pd.TimedeltaIndex(my_df['hour'], unit='h')
this assumes that my_df['date'] is already Datetime if not convert first using my_df['date'] = pd.to_datetime(my_df['date'])

merge and sample two pandas time series

I have two time series. I would like to merge them and asfreq(*, method='pad') the result, restricted to the time range they share in common.
To illustrate, suppose I define A and B like this:
import datetime as dt
import numpy as np
import pandas as pd
A = pd.Series(np.arange(4), index=pd.date_range(dt.datetime(2017,1,4,10,0,0),
periods=4, freq=dt.timedelta(seconds=10)))
B = pd.Series(np.arange(6), index=pd.date_range(dt.datetime(2017,1,4,10,0,7),
periods=6, freq=dt.timedelta(seconds=3)))
So they look like:
# A
2017-01-04 10:00:00 0
2017-01-04 10:00:10 1
2017-01-04 10:00:20 2
2017-01-04 10:00:30 3
# B
2017-01-04 10:00:07 0
2017-01-04 10:00:10 1
2017-01-04 10:00:13 2
2017-01-04 10:00:16 3
2017-01-04 10:00:19 4
2017-01-04 10:00:22 5
I would like to compute something like:
# combine_and_asfreq(A, B, dt.timedelta(seconds=5))
# timestamp A B
2017-01-04 10:00:07 0 0
2017-01-04 10:00:12 1 1
2017-01-04 10:00:17 1 3
2017-01-04 10:00:22 2 5
How can I do this?
I am not exactly sure what you are asking but here is a somewhat convoluted method that first finds the overlapping time and creates a single column dataframe as the 'base' dataframe with the 5s timedelta.
get started by setting up dataframes properly
start = max(A.index.min(), B.index.min())
end = min(A.index.max(), B.index.max())
df_time = pd.DataFrame({'time': pd.date_range(start,end,freq='5s')})
df_A = A.reset_index()
df_B = B.reset_index()
df_A.columns = ['time', 'value']
df_B.columns = ['time', 'value']
Now we have the following three dataframes.
df_A
time value
0 2017-01-04 10:00:00 0
1 2017-01-04 10:00:10 1
2 2017-01-04 10:00:20 2
3 2017-01-04 10:00:30 3
df_B
time value
0 2017-01-04 10:00:07 0
1 2017-01-04 10:00:10 1
2 2017-01-04 10:00:13 2
3 2017-01-04 10:00:16 3
4 2017-01-04 10:00:19 4
5 2017-01-04 10:00:22 5
df_time
time
0 2017-01-04 10:00:07
1 2017-01-04 10:00:12
2 2017-01-04 10:00:17
3 2017-01-04 10:00:22
Use merge_asof to join all three
pd.merge_asof(pd.merge_asof(df_time, df_A, on='time'), df_B, on='time', suffixes=('_A', '_B'))
time value_A value_B
0 2017-01-04 10:00:07 0 0
1 2017-01-04 10:00:12 1 1
2 2017-01-04 10:00:17 1 3
3 2017-01-04 10:00:22 2 5

Categories