pandas resample timed events in DataFrame to precise time-bins - python

maybe I could not find it... anyhow, with pandas '0.19.2' there is the following
problem:
I have some timed events of associated groups which can be generated by:
from numpy.random import randint, seed
import pandas as pd
seed(42) # reproducibility
samp_N = 1000
# create times within 3 hours, and 15 random groups
df = pd.DataFrame({'time': randint(0,3*60*60, samp_N),
'group': randint(0, 15, samp_N)})
# make a resample-able index from the seconds time values
df.set_index(pd.TimedeltaIndex(df.time, 's'), inplace=True)
which looks like:
group time
02:01:10 10 7270
00:14:20 13 860
01:29:50 9 5390
01:26:31 13 5191
...
When I try to resample the events, I get something undesirable
df.resample('5T').count()
group time
00:00:04 28 28
00:05:04 18 18
00:10:04 32 32
...
Unfortunately the resampling periods start at arbitrary (first in data) offset values.
It is even more annoying if I group this (as ultimately required)
df.groupby('group').resample('5T').count()
then I get a new offset for each group
what I want is the precise start of sampling windows:
00:00:00 5 ...
00:05:00 17 ...
00:10:00 11 ...
...
there was a suggestion in: https://stackoverflow.com/a/23966229
df.groupby(pd.TimeGrouper('5Min')).count()
but it does not work either, as it also ruins the grouping required above.
thanks for hints!

Unfortunately i didn't come up with a nice solution but rather a work around. I added a dummy row with time value zero and then grouped by time and group:
df = pd.Series({'time':0,'group':-1}).to_frame().T.set_index(pd.TimedeltaIndex([0], 's')).append(df)
df = df.groupby([pd.Grouper(freq='5Min'), 'group']).count().reset_index('group')
df = df.loc[df['group']!=-1]
df.head()
group time
0 days 0 2
0 days 1 4
0 days 2 3
0 days 3 1
0 days 4 2

I am not sure this is the result you want:
result = df.groupby(['group', pd.Grouper(freq='5Min')]).count().reset_index(level=0)
result.head()
>>> group time
00:05:00 0 2
00:10:00 0 1
00:15:00 0 3
00:20:00 0 2
00:30:00 0 1
result.sort_index().head()
>>> group time
0 days 10 1
0 days 14 3
0 days 2 1
0 days 13 1
0 days 4 3

Related

Convert string hours to minute pd.eval

I want to convert all rows of my DataFrame that contains hours and minutes into minutes only.
I have a dataframe that looks like this:
df=
time
0 8h30
1 14h07
2 08h30
3 7h50
4 8h0
5 8h15
6 6h15
I'm using the following method to convert:
df['time'] = pd.eval(
df['time'].replace(['h'], ['*60+'], regex=True))
Output
SyntaxError: invalid syntax
I think the error comes from the format of the hour, maybe pd.evalcant accept 08h30 or 8h0, how to solve this probleme ?
Pandas can already handle such strings if the units are included in the string. While 14h07 can't be parse (why assume 07 is minutes?), 14h07 can be converted to a Timedelta :
>>> pd.to_timedelta("14h07m")
Timedelta('0 days 14:07:00')
Given this dataframe :
d1 = pd.DataFrame(['8h30m', '14h07m', '08h30m', '8h0m'],
columns=['time'])
You can convert the time series into a Timedelta series with pd.to_timedelta :
>>> d1['tm'] = pd.to_timedelta(d1['time'])
>>> d1
time tm
0 8h30m 0 days 08:30:00
1 14h07m 0 days 14:07:00
2 08h30m 0 days 08:30:00
3 8h0m 0 days 08:00:00
To handle the missing minutes unit in the original data, just append m:
d1['tm'] = pd.to_timedelta(d1['time'] + 'm')
Once you have a Timedelta you can calculate hours and minutes.
The components of the values can be retrieved with Timedelta.components
>>> d1.tm.dt.components.hours
0 8
1 14
2 8
3 8
Name: hours, dtype: int64
To get the total minutes, seconds or hours, change the frequency to minutes:
>>> d1.tm.astype('timedelta64[m]')
0 510.0
1 847.0
2 510.0
3 480.0
Name: tm, dtype: float64
Bringing all the operations together :
>>> d1['tm'] = pd.to_timedelta(d1['time'])
>>> d2 = (d1.assign(h=d1.tm.dt.components.hours,
... m=d1.tm.dt.components.minutes,
... total_minutes=d1.tm.astype('timedelta64[m]')))
>>>
>>> d2
time tm h m total_minutes
0 8h30m 0 days 08:30:00 8 30 510.0
1 14h07m 0 days 14:07:00 14 7 847.0
2 08h30m 0 days 08:30:00 8 30 510.0
3 8h0m 0 days 08:00:00 8 0 480.0
To avoid having to trim leading zeros, an alternative approach:
df[['h', 'm']] = df['time'].str.split('h', expand=True).astype(int)
df['total_min'] = df['h']*60 + df['m']
Result:
time h m total_min
0 8h30 8 30 510
1 14h07 14 7 847
2 08h30 8 30 510
3 7h50 7 50 470
4 8h0 8 0 480
5 8h15 8 15 495
6 6h15 6 15 375
Just to give an alternative approach with kind of the same elements as above you could do:
df = pd.DataFrame(data=["8h30", "14h07", "08h30", "7h50", "8h0 ", "8h15", "6h15"],
columns=["time"])
First split you column on the "h"
hm = df["time"].str.split("h", expand=True)
Then combine the columns again, but zeropad time hours and minutes in order to make valid time strings:
df2 = hm[0].str.strip().str.zfill(2) + hm[1].str.strip().str.zfill(2)
Then convert the string column with proper values to a date time column:
df3 = pd.to_datetime(df2, format="%H%M")
Finally, calculate the number of minutes by subtrackting a zero time (to make deltatimes) and divide by the minutes deltatime:
zerotime= pd.to_datetime("0000", format="%H%M")
df['minutes'] = (df3 - zerotime) / pd.Timedelta(minutes=1)
The results look like:
time minutes
0 8h30 510.0
1 14h07 847.0
2 08h30 510.0
3 7h50 470.0
4 8h0 480.0
5 8h15 495.0
6 6h15 375.0

Group by list of different time ranges in Pandas

Edit: Changing example to use Timedelta indices.
I have a DataFrame of different time ranges that represent indices in my main DataFrame. eg:
ranges = pd.DataFrame(data=np.array([[1,10,20],[3,15,30]]).T, columns=["Start","Stop"])
ranges = ranges.apply(pd.to_timedelta, unit="s")
ranges
Start Stop
0 0 days 00:00:01 0 days 00:00:03
1 0 days 00:00:10 0 days 00:00:15
2 0 days 00:00:20 0 days 00:00:30
my_data= pd.DataFrame(data=list(range(0,40*5,5)), columns=["data"])
my_data.index = pd.to_timedelta(my_data.index, unit="s")
I want to calculate the averages of the data in my_data for each of the time ranges in ranges. How can I do this?
One option would be as follows:
ranges.apply(lambda row: my_data.loc[row["Start"]:row["Stop"]].iloc[:-1].mean(), axis=1)
data
0 7.5
1 60.0
2 122.5
But can we do this without apply?
Here is one way to approach it:
Generate the timedeltas and concatenate into a single block:
# note the use of closed='left' (`Stop` is not included in the build)
timedelta = [pd.timedelta_range(a,b, closed='left', freq='1s')
for a, b in zip(ranges.Start, ranges.Stop)]
timedelta = timedelta[0].append(timedelta[1:])
Get the grouping which will be used for the groupby and aggregation:
counts = ranges.Stop.sub(ranges.Start).dt.seconds
counts = np.arange(counts.size).repeat(counts)
Group by and aggregate:
my_data.loc[timedelta].groupby(counts).mean()
data
0 7.5
1 60.0
2 122.5

Pandas : SQL SelfJoin With Date Criteria

One query I often do in SQL within a relational database is to join a table back to itself and summarize each row based on records for the same id either backwards or forward in time.
For example, assume table1 as columns 'ID','Date', 'Var1'
In SQL I could sum var1 for the past 3 months for each record like this:
Select a.ID, a.Date, sum(b.Var1) as sum_var1
from table1 a
left outer join table1 b
on a.ID = b.ID
and months_between(a.date,b.date) <0
and months_between(a.date,b.date) > -3
Is there any way to do this in Pandas?
It seems you need GroupBy + rolling. Implementing the logic in precisely the same way it is written in SQL is likely to be expensive as it will involve repeated loops. Let's take an example dataframe:
Date ID Var1
0 2015-01-01 1 0
1 2015-02-01 1 1
2 2015-03-01 1 2
3 2015-04-01 1 3
4 2015-05-01 1 4
5 2015-01-01 2 5
6 2015-02-01 2 6
7 2015-03-01 2 7
8 2015-04-01 2 8
9 2015-05-01 2 9
You can add a column which, by group, looks back and sums a variable over a fixed period. First define a function utilizing pd.Series.rolling:
def lookbacker(x):
"""Sum over past 70 days"""
return x.rolling('70D').sum().astype(int)
Then apply it on a GroupBy object and extract values for assignment:
df['Lookback_Sum'] = df.set_index('Date').groupby('ID')['Var1'].apply(lookbacker).values
print(df)
Date ID Var1 Lookback_Sum
0 2015-01-01 1 0 0
1 2015-02-01 1 1 1
2 2015-03-01 1 2 3
3 2015-04-01 1 3 6
4 2015-05-01 1 4 9
5 2015-01-01 2 5 5
6 2015-02-01 2 6 11
7 2015-03-01 2 7 18
8 2015-04-01 2 8 21
9 2015-05-01 2 9 24
It appears pd.Series.rolling does not work with months, e.g. using '2M' (2 months) instead of '70D' (70 days) gives ValueError: <2 * MonthEnds> is a non-fixed frequency. This makes sense since a "month" is ambiguous given months have different numbers of days.
Another point worth mentioning is you can use GroupBy + rolling directly and possibly more efficiently by bypassing apply, but this requires ensuring your index is monotic. For example, via sort_index:
df['Lookback_Sum'] = df.set_index('Date').sort_index()\
.groupby('ID')['Var1'].rolling('70D').sum()\
.astype(int).values
I don't think pandas.DataFrame.rolling() supports rolling-window aggregation by some number of months; currently, you must specify a fixed number of days, or other fixed-length time period.
But as #jpp mentioned, you can use python loops to perform rolling aggregation over a window size specified in calendar months, where the number of days in each window will vary, depending on what part of the calendar you're rolling over.
The following approach builds on this SO answer as well as #jpp's:
# Build some example data:
# 3 unique IDs, each with 365 samples, one sample per day throughout 2015
df = pd.DataFrame({'Date': pd.date_range('2015-01-01', '2015-12-31', freq='D'),
'Var1': list(range(365))})
df = pd.concat([df] * 3)
df['ID'] = [1]*365 + [2]*365 + [3]*365
df.head()
Date Var1 ID
0 2015-01-01 0 1
1 2015-01-02 1 1
2 2015-01-03 2 1
3 2015-01-04 3 1
4 2015-01-05 4 1
# Define a lookback function that mimics rolling aggregation,
# but uses DateOffset() slicing, rather than a window of fixed size.
# Use .count() here as a sanity check; you will need .sum()
def lookbacker(ser):
return pd.Series([ser.loc[d - pd.offsets.DateOffset(months=3):d].count()
for d in ser.index])
# By default, groupby.agg output is sorted by key. So make sure to
# sort df by (ID, Date) before inserting the flattened groupby result
# into a new column
df.sort_values(['ID', 'Date'], inplace=True)
df.set_index('Date', inplace=True)
df['window_size'] = df.groupby('ID')['Var1'].apply(lookbacker).values
# Manually check the resulting window sizes
df.head()
Var1 ID window_size
Date
2015-01-01 0 1 1
2015-01-02 1 1 2
2015-01-03 2 1 3
2015-01-04 3 1 4
2015-01-05 4 1 5
df.tail()
Var1 ID window_size
Date
2015-12-27 360 3 92
2015-12-28 361 3 92
2015-12-29 362 3 92
2015-12-30 363 3 92
2015-12-31 364 3 93
df[df.ID == 1].loc['2015-05-25':'2015-06-05']
Var1 ID window_size
Date
2015-05-25 144 1 90
2015-05-26 145 1 90
2015-05-27 146 1 90
2015-05-28 147 1 90
2015-05-29 148 1 91
2015-05-30 149 1 92
2015-05-31 150 1 93
2015-06-01 151 1 93
2015-06-02 152 1 93
2015-06-03 153 1 93
2015-06-04 154 1 93
2015-06-05 155 1 93
The last column gives the lookback window size in days, looking back from that date, including both the start and end dates.
Looking "3 months" before 2016-05-31 would land you at 2015-02-31, but February has only 28 days in 2015. As you can see in the sequence 90, 91, 92, 93 in the above sanity check, This DateOffset approach maps the last four days in May to the last day in February:
pd.to_datetime('2015-05-31') - pd.offsets.DateOffset(months=3)
Timestamp('2015-02-28 00:00:00')
pd.to_datetime('2015-05-30') - pd.offsets.DateOffset(months=3)
Timestamp('2015-02-28 00:00:00')
pd.to_datetime('2015-05-29') - pd.offsets.DateOffset(months=3)
Timestamp('2015-02-28 00:00:00')
pd.to_datetime('2015-05-28') - pd.offsets.DateOffset(months=3)
Timestamp('2015-02-28 00:00:00')
I don't know if this matches SQL's behaviour, but in any case, you'll want to test this and decide if this makes sense in your case.
you could use lambda to achieve it.
table1['sum_var1'] = table1.apply(lambda row: findSum(row), axis=1)
and we should write an equivalent method for months_between
the complete example is
from datetime import datetime
import datetime as dt
import pandas as pd
def months_between(date1, date2):
if date1.day == date2.day:
return (date1.year - date2.year) * 12 + date1.month - date2.month
# if both are last days
if date1.month != (date1 + dt.timedelta(days=1)).month :
if date2.month != (date2 + dt.timedelta(days=1)).month :
return date1.month - date2.month
return (date1 - date2).days / 31
def findSum(cRow):
table1['month_diff'] = table1['Date'].apply(months_between, date2=cRow['Date'])
filtered_table = table1[(table1["month_diff"] < 0) & (table1["month_diff"] > -3) & (table1['ID'] == cRow['ID'])]
if filtered_table.empty:
return 0
return filtered_table['Var1'].sum()
table1 = pd.DataFrame(columns = ['ID', 'Date', 'Var1'])
table1.loc[len(table1)] = [1, datetime.strptime('2015-01-01','%Y-%m-%d'), 0]
table1.loc[len(table1)] = [1, datetime.strptime('2015-02-01','%Y-%m-%d'), 1]
table1.loc[len(table1)] = [1, datetime.strptime('2015-03-01','%Y-%m-%d'), 2]
table1.loc[len(table1)] = [1, datetime.strptime('2015-04-01','%Y-%m-%d'), 3]
table1.loc[len(table1)] = [1, datetime.strptime('2015-05-01','%Y-%m-%d'), 4]
table1.loc[len(table1)] = [2, datetime.strptime('2015-01-01','%Y-%m-%d'), 5]
table1.loc[len(table1)] = [2, datetime.strptime('2015-02-01','%Y-%m-%d'), 6]
table1.loc[len(table1)] = [2, datetime.strptime('2015-03-01','%Y-%m-%d'), 7]
table1.loc[len(table1)] = [2, datetime.strptime('2015-04-01','%Y-%m-%d'), 8]
table1.loc[len(table1)] = [2, datetime.strptime('2015-05-01','%Y-%m-%d'), 9]
table1['sum_var1'] = table1.apply(lambda row: findSum(row), axis=1)
table1.drop(columns=['month_diff'], inplace=True)
print(table1)

Pandas elements per week between start date and end date

I'm starting from a dataframe that has a start date and an end date, for instance:
ID START END A
0 2014-04-09 2014-04-15 5
1 2018-06-05 2018-07-01 8
2 2018-06-05 2018-07-01 7
And I'm trying to find, for each week, how many elements were started but not ended at that point.
For instance, in the DF above:
Week-Monday N
2014-04-07 1
2014-04-14 1
2014-04-21 0
...
2018-06-04 2
...
Something like the below doesn't quite work, since it only resamples on end date:
df = df.resample("W-Mon", on="END").sum()
I don't know how to integrate both conditions: that the occurrences be after the start date, yet before the end date.
You can start from here:
import pandas as pd
df = pd.DataFrame({'ID':[0,1,2],
'START':['2014-04-09', '2018-06-05', '2018-06-05'],
'END':['2014-04-15', '2018-07-01', '2018-07-01'],
'A':[5,8,7]})
1- Find week number for each SRART and each END, and find Week-Monday.
import datetime, time
from datetime import timedelta
df.loc[:,'startWeek'] = df.START.apply(lambda x: datetime.datetime.strptime(x,'%Y-%m-%d').isocalendar()[1])
df.loc[:,'endWeek'] = df.END.apply(lambda x: datetime.datetime.strptime(x,'%Y-%m-%d').isocalendar()[1])
df.loc[:, 'Week-Monday'] = df.START.apply(lambda x: datetime.datetime.strptime(x,'%Y-%m-%d')- timedelta(days=datetime.datetime.strptime(x,'%Y-%m-%d').weekday()))
2- Check if they are the same, if yes, then ended during the same week.
def endedNotSameWeek(row):
if row['startWeek']!=row['endWeek']:
return 1
return 0
df.loc[:,'NotSameWeek'] = df.apply(endedNotSameWeek, axis=1)
print(df)
Output:
ID START END A startWeek endWeek Week-Monday NotSameWeek
0 0 2014-04-09 2014-04-15 5 15 16 2014-04-07 1
1 1 2018-06-05 2018-07-01 8 23 26 2018-06-04 1
2 2 2018-06-05 2018-07-01 7 23 26 2018-06-04 1
3- Groupby each Week-Monday to get the number of cases did not end during the same week.
df.groupby('Week-Monday')['NotSameWeek'].agg({'N':'sum'}).reset_index()
Week-Monday N
0 2014-04-07 1
1 2018-06-04 2

Dividing a series containing datetime by a series containing an integer in Pandas

I have a series s1 which is of type datetime and has a time which represents a range between a start time and an end time - typical values are 7 days, 4 hours 5 mins etc. I have series s2 which contains integers for the number of events that happened in that time range.
I want to calculate the event frequency by:
event_freq = s1 / s2
I get the error:
cannot operate on a series with out a rhs of a series/ndarray of type datetime64[ns] or a timedelta
Whats the best way to fix this?
Thanks in advance!
EXAMPLE of s1 is:
some_id
1 2012-09-02 09:18:40
3 2012-04-02 09:36:39
4 2012-02-02 09:58:02
5 2013-02-09 14:31:52
6 2012-01-09 12:59:20
EXAMPLE of s2 is:
some_id
1 3
3 1
4 1
5 2
6 1
8 1
10 3
12 2
This might possibly be a bug but what works is to operate on the underlying numpy array like so:
import pandas as pd
from pandas import Series
startdate = Series(pd.date_range('2013-01-01', '2013-01-03'))
enddate = Series(pd.date_range('2013-03-01', '2013-03-03'))
s1 = enddate - startdate
s2 = Series([2, 3, 4])
event_freq = Series(s1.values / s2)
Here are the Series:
>>> s1
0 59 days, 00:00:00
1 59 days, 00:00:00
2 59 days, 00:00:00
dtype: timedelta64[ns]
>>> s2
0 2
1 3
2 4
dtype: int64
>>> event_freq
0 29 days, 12:00:00
1 19 days, 16:00:00
2 14 days, 18:00:00
dtype: timedelta64[ns]

Categories