how to merge group rows in dataframe based on differences between datetime?

how to merge group rows in dataframe based on differences between datetime? - python

I have a dataframe with contains events on each row, with a Start and End datatime.
import pandas as pd
import datetime
df = pd.DataFrame({ 'Value' : [1.,2.,3.],
'Start' : [datetime.datetime(2017,1,1,0,0,0),datetime.datetime(2017,1,1,0,1,0),datetime.datetime(2017,1,1,0,4,0)],
'End' : [datetime.datetime(2017,1,1,0,0,59),datetime.datetime(2017,1,1,0,5,0),datetime.datetime(2017,1,1,0,6,00)]},
index=[0,1,2])
df
Out[7]:
End Start Value
0 2017-01-01 00:00:59 2017-01-01 00:00:00 1.0
1 2017-01-01 00:05:00 2017-01-01 00:01:00 2.0
2 2017-01-01 00:07:00 2017-01-01 00:06:00 3.0
I would like to group consecutive rows where the the differences between End and Start of consecutive rows is smaller than a given timedelta.
e.g. here for a timedelta of 5 seconds I would like to group row with index 0,1 and with timedelta of 2 minutes it should yield in rows 0,1,2
A solution would be to compare consecutive rows with their shifted version using .shift(), however, I would need to iterate the comparison multiple times if groups of more than 2 rows need to be merged.
As my df is very large, this is not an option.

threshold = datetime.timedelta(minutes=5)
df['delta'] = df['End'] - df['Start']
df['group'] = (df['delta'] - df['delta'].shift(-1) <= threshold).cumsum()
groups = df.groupby('group')

i assume you try to aggregate based on time difference.
marker = 60
df = df.assign(diff=df.apply(lambda row:(row.End - row.Start).total_seconds() <= marker, axis=1))
for g in df.groupby('diff'):
print g[1]
End Start Value diff
1 2017-01-01 00:05:00 2017-01-01 00:01:00 2.0 False
2 2017-01-01 00:06:00 2017-01-01 00:04:00 3.0 False
End Start Value diff
0 2017-01-01 00:00:59 2017-01-01 1.0 True

Related

Pandas custom re-sample for time series data

I have a time series data in 1 Min frequency. I would like re-sample the data for every 5 min and re-sample data should include the data of first time step, middle time step and last time step.
I have tried like this, but I am not getting what I am expecting...
def my_fun(array)
return array[0],array[-1]
df=pd.DataFrame(np.arange(60),index=pd.date_range('2017-01-01 00:00','2017-01-01 00:59', freq='1T'
df.resample('5T').apply(my_fun)

If I understood you correctly then you want the data for minutes 0,2,4,5,7,9,10,... in a new dataframe. A faster way than using resample may be:
df=pd.DataFrame(np.arange(60),index=pd.date_range('2017-01-01 00:00','2017-01-01 00:59', freq='1T'))
l = len(df)
df.loc[df.iloc[range(2,l,5)].index | df.iloc[range(4,l,5)].index | df.iloc[range(0,l,5)].index]
Output:
0
2017-01-01 00:00:00 0
2017-01-01 00:02:00 2
2017-01-01 00:04:00 4
2017-01-01 00:05:00 5
2017-01-01 00:07:00 7
2017-01-01 00:09:00 9
2017-01-01 00:10:00 10
If you just wanted a combined list of your selected data in one row then you were almost there:
def my_fun(array):
return [array[0], array[2], array[4]]
df=pd.DataFrame({'0':np.arange(60)}, index=pd.date_range('2017-01-01 00:00','2017-01-01 00:59', freq='1T'))
df.resample('5T').apply(my_fun)
Output:
0
2017-01-01 00:00:00 (0, 2, 4)
2017-01-01 00:05:00 (5, 7, 9)
2017-01-01 00:10:00 (10, 12, 14)

Add time interval values in new column Pandas

I have a large pandas dataframe (40 million rows) with the following format :
ID DATETIME TIMESTAMP
81215545953683710540 2017-01-01 17:39:57 1483243205
74994612102903447699 2017-01-01 19:14:12 1483243261
48126186377367976994 2017-01-01 17:19:29 1483243263
23522333658893375671 2017-01-01 12:50:46 1483243266
16194691060240380504 2017-01-01 15:59:23 1483243353
I am trying to assign a value to each row depending on the timestamp so that i have group of rows with the same value if they are in the same time interval.
Let's say I have t0 = 1483243205 and I want a differently value when TIMESTAMP = t0+10 . So here my time interval would be of 10.
I would like something like that :
ID DATETIME TIMESTAMP VALUE
81215545953683710540 2017-01-01 17:39:57 1483243205 0
74994612102903447699 2017-01-01 19:14:12 1483243261 5
48126186377367976994 2017-01-01 17:19:29 1483243263 5
23522333658893375671 2017-01-01 12:50:46 1483243266 6
16194691060240380504 2017-01-01 15:59:23 1483243288 8
Here is my code :
df['VALUE']=''
t=1483243205
j=0
for i in range(0,len(df['TIMESTAMP'])):
while(df.iloc[i][2])<(t+10):
df['VALUE'][i]=j
i+=1
t+=10
j+=1
I have a warning when executing my code (SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame) and I have the following result :
ID DATETIME TIMESTAMP VALUE
81215545953683710540 2017-01-01 17:39:57 1483243205 0
74994612102903447699 2017-01-01 19:14:12 1483243261
48126186377367976994 2017-01-01 17:19:29 1483243263
23522333658893375671 2017-01-01 12:50:46 1483243266
16194691060240380504 2017-01-01 15:59:23 1483243288
It is not the first time I encounter the warning and I always overcame it, but I am confused with the fact I only got a value for the first row.
Does anyone know what I am missing ?
Thanks

I would suggest using pandas' cut method to achieve this, preventing the need to explicitly loop through your DataFrame.
tmin, tmax = df['TIMESTAMP'].min(), df['TIMESTAMP'].max()
bins = [i for i in range(tmin, tmax+10, 10)]
labels = [i for i in range(len(bins)-1)]
df['VALUE'] = pd.cut(df['TIMESTAMP'], bins=bins, labels=labels, include_lowest=True)
ID DATETIME TIMESTAMP VALUE
0 81215545953683710540 2017-01-01 17:39:57 1483243205 0
1 74994612102903447699 2017-01-01 19:14:12 1483243261 5
2 48126186377367976994 2017-01-01 17:19:29 1483243263 5
3 23522333658893375671 2017-01-01 12:50:46 1483243266 6
4 16194691060240380504 2017-01-01 15:59:23 1483243288 8

Pandas groupby aggregation to truncate earliest date instead of oldest date

I'm trying to aggregate from the end of a date range instead of from the beginning. Despite the fact that I would think that adding closed='right' to the grouper would solve the issue, it doesn't. Please let me know how I can achieve my desired output shown at the bottom, thanks.
import pandas as pd
df = pd.DataFrame(columns=['date','number'])
df['date'] = pd.date_range('1/1/2000', periods=8, freq='T')
df['number'] = pd.Series(range(8))
df
date number
0 2000-01-01 00:00:00 0
1 2000-01-01 00:01:00 1
2 2000-01-01 00:02:00 2
3 2000-01-01 00:03:00 3
4 2000-01-01 00:04:00 4
5 2000-01-01 00:05:00 5
6 2000-01-01 00:06:00 6
7 2000-01-01 00:07:00 7
With the groupby and aggregation of the date I get the following. Since I have 8 dates and I'm grouping by periods of 3 it must choose whether to truncate the earliest date group or the oldest date group, and it chooses the oldest date group (the oldest date group has a count of 2):
df.groupby(pd.Grouper(key='date', freq='3T')).agg('count')
date number
2000-01-01 00:00:00 3
2000-01-01 00:03:00 3
2000-01-01 00:06:00 2
My desired output is to instead truncate the earliest date group:
date number
2000-01-01 00:00:00 2
2000-01-01 00:02:00 3
2000-01-01 00:05:00 3
Please let me know how this can be achieved, I'm hopeful there's just a parameter that can be set that I've overlooked. Note that this is similar to this question, but my question is specific to the date truncation.
EDIT: To reframe the question (thanks Alexdor) the default behavior in pandas is to bin by period [0, 3), [3, 6), [6, 9) but instead I'd like to bin by (-1, 2], (2, 5], (5, 8]

It seems like the grouper function build up the bins starting from the oldest time in the series that you pass to it. I couldn't see a way to make it build up the bins from the newest time, but it's fairly easy to construct the bins from scratch.
freq = '3min'
minTime = df.date.min()
maxTime = df.date.max()
deltaT = pd.Timedelta(freq)
minTime -= deltaT - (maxTime - minTime) % deltaT # adjust min time to start of first bin
r = pd.date_range(start=minTime, end=maxTime, freq=freq)
df.groupby(pd.cut(df["date"], r)).agg('count')
Gives
date date number
(1999-12-31 23:58:00, 2000-01-01 00:01:00] 2 2
(2000-01-01 00:01:00, 2000-01-01 00:04:00] 3 3
(2000-01-01 00:04:00, 2000-01-01 00:07:00] 3 3

This is one hack, which let's you group by a constant group size, counting bottom up.
from itertools import chain
def grouper(x, k=3):
n = len(df.index)
return list(chain.from_iterable([[0]*int(n//k)] + [[i]*k for i in range(1, int(n/k)+1)]))
df['grouper'] = grouper(df, 3)
res = df.groupby('grouper', as_index=False)\
.agg({'date': 'first', 'number': 'count'})\
.drop('grouper', 1)
# date number
# 0 2000-01-01 00:00:00 2
# 1 2000-01-01 00:02:00 3
# 2 2000-01-01 00:05:00 3

Collapsing rows with overlapping dates

I'm trying to collapse the below dataframe into rows containing continuous time periods by id. Continuous means that, within the same id, the start_date is either less than, equal or at most one day greater than the end_date of the previous row (the data is already sorted by id, start_date and end_date). All rows that are continuous should be output as a single row, with start_date being the minimum start_date (i.e. the start_date of the first row in the continuous set) and end_date being the maximum end_date from the continuous set of rows.
Please see the desired output at the bottom.
The only way I can think of approaching this is by parsing the dataframe row by row, but I was wondering if there a more pythonic and elegant way to do it.
id start_date end_date
1 2017-01-01 2017-01-15
1 2017-01-12 2017-01-24
1 2017-01-25 2017-02-03
1 2017-02-05 2017-02-14
1 2017-02-16 2017-02-28
2 2017-01-01 2017-01-19
2 2017-01-24 2017-02-07
2 2017-02-07 2017-02-20
Here is the code to generate the input dataframe:
import numpy as np
import pandas as pd
start_date = ['2017-01-01','2017-01-12','2017-01-25','2017-02-05','2017-02-16',
'2017-01-01','2017-01-24','2017-02-07']
end_date = ['2017-01-15','2017-01-24','2017-02-03','2017-02-14','2017-02-28',
'2017-01-19','2017-02-07','2017-02-20']
df = pd.DataFrame({'id': [1,1,1,1,1,2,2,2],
'start_date': pd.to_datetime(start_date, format='%Y-%m-%d'),
'end_date': pd.to_datetime(end_date, format='%Y-%m-%d')})
Desired output:
id start_date end_date
1 2017-01-01 2017-02-03
1 2017-02-05 2017-02-14
1 2017-02-16 2017-02-28
2 2017-01-01 2017-01-19
2 2017-01-24 2017-02-20

def f(grp):
#define a list to collect valid start and end ranges
d=[]
(
#append a new row if the start date is at least 2 days greater than the last date from previous row,
#otherwise update last rows's end date with current row's end date.
grp.reset_index(drop=True)
.apply(lambda x: d.append({x.start_date:x.end_date})
if x.name==0 or (x.start_date-pd.DateOffset(1))>grp.iloc[x.name-1].end_date
else d[-1].update({list(d[-1].keys())[0]:x.end_date}),
axis=1)
)
#reconstruct a df using only valid start and end dates pairs.
return pd.DataFrame([[list(e.keys())[0],list(e.values())[0]] for e in d],
columns=['start_date','end_date'])
df.groupby('id').apply(f).reset_index().drop('level_1',1)
Out[467]:
id start_date end_date
0 1 2017-01-01 2017-02-03
1 1 2017-02-05 2017-02-14
2 1 2017-02-16 2017-02-28
3 2 2017-01-01 2017-01-19
4 2 2017-01-24 2017-02-20

adding column with per-row computed time difference from group start?

(newbie to python and pandas)
I have a data set of 15 to 20 million rows, each row is a time-indexed observation of a time a 'user' was seen, and I need to analyze the visit-per-day patterns of each user, normalized to their first visit. So, I'm hoping to plot with an X axis of "days after first visit" and a Y axis of "visits by this user on this day", i.e., I need to get a series indexed by a timedelta and with values of visits in the period ending with that delta [0:1, 3:5, 4:2, 6:8,] But I'm stuck very early ...
I start with something like this:
rng = pd.to_datetime(['2000-01-01 08:00', '2000-01-02 08:00',
'2000-01-01 08:15', '2000-01-02 18:00',
'2000-01-02 17:00', '2000-03-01 08:00',
'2000-03-01 08:20','2000-01-02 18:00'])
uid=Series(['u1','u2','u1','u2','u1','u2','u2','u3'])
misc=Series(['','x1','A123','1.23','','','','u3'])
df = DataFrame({'uid':uid,'misc':misc,'ts':rng})
df=df.set_index(df.ts)
grouped = df.groupby('uid')
firstseen = grouped.first()
The ts values are unique to each uid, but can be duplicated (two uid can be seen at the same time, but any one uid is seen only once at any one timestamp)
The first step is (I think) to add a new column to the DataFrame, showing for each observation what the timedelta is back to the first observation for that user. But, I'm stuck getting that column in the DataFrame. The simplest thing I tried gives me an obscure-to-newbie error message:
df['sinceseen'] = df.ts - firstseen.ts[df.uid]
...
ValueError: cannot reindex from a duplicate axis
So I tried a brute-force method:
def f(row):
return row.ts - firstseen.ts[row.uid]
df['sinceseen'] = Series([{idx:f(row)} for idx, row in df.iterrows()], dtype=timedelta)
In this attempt, df gets a sinceseen but it's all NaN and shows a type of float for type(df.sinceseen[0]) - though, if I just print the Series (in iPython) it generates a nice list of timedeltas.
I'm working back and forth through "Python for Data Analysis" and it seems like apply() should work, but
def fg(ugroup):
ugroup['sinceseen'] = ugroup.index - ugroup.index.min()
return ugroup
df = df.groupby('uid').apply(fg)
gives me a TypeError on the "ugroup.index - ugroup.index.min(" even though each of the two operands is a Timestamp.
So, I'm flailing - can someone point me at the "pandas" way to get to the data structure Ineed?

Does this help you get started?
>>> df = DataFrame({'uid':uid,'misc':misc,'ts':rng})
>>> df = df.sort(["uid", "ts"])
>>> df["since_seen"] = df.groupby("uid")["ts"].apply(lambda x: x - x.iloc[0])
>>> df
misc ts uid since_seen
0 2000-01-01 08:00:00 u1 0 days, 00:00:00
2 A123 2000-01-01 08:15:00 u1 0 days, 00:15:00
4 2000-01-02 17:00:00 u1 1 days, 09:00:00
1 x1 2000-01-02 08:00:00 u2 0 days, 00:00:00
3 1.23 2000-01-02 18:00:00 u2 0 days, 10:00:00
5 2000-03-01 08:00:00 u2 59 days, 00:00:00
6 2000-03-01 08:20:00 u2 59 days, 00:20:00
7 u3 2000-01-02 18:00:00 u3 0 days, 00:00:00
[8 rows x 4 columns]

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

how to merge group rows in dataframe based on differences between datetime? - python

threshold = datetime.timedelta(minutes=5) df['delta'] = df['End'] - df['Start'] df['group'] = (df['delta'] - df['delta'].shift(-1) <= threshold).cumsum() groups = df.groupby('group')

Related

Pandas custom re-sample for time series data

Add time interval values in new column Pandas

Pandas groupby aggregation to truncate earliest date instead of oldest date

Collapsing rows with overlapping dates

adding column with per-row computed time difference from group start?

Categories

Resources