I have the following dataset:
date event next_event duration_Minutes
2021-09-09 22:30:00 1 2021-09-09 23:00:00 30
2021-09-09 23:00:00 2 2021-09-09 23:10:00 10
2021-09-09 23:10:00 1 2021-09-09 23:50:00 40
2021-09-09 23:50:00 4 2021-09-10 00:50:00 60
2021-09-10 00:50:00 4 2021-09-12 00:50:00 2880
The main problem is that I would like to split the multi-day events into separate events in the following way. I would like to have the event duration from 2021-09-09 23:50:00 until 2021-09-10 00: 00: 00 and then the duration from 2021-09-10 00: 00: 00 to 2021-09-10 00:50:00, and so on. This would be useful because after, I would need to group the events by day and calculate the duration of the each event by day, so I would like to fix these situation in which there is the day change between events.
I would like to obtain something like this:
date event next_event duration_Minutes
2021-09-09 22:30:00 1 2021-09-09 23:00:00 30
2021-09-09 23:00:00 2 2021-09-09 23:10:00 10
2021-09-09 23:10:00 1 2021-09-09 23:50:00 40
2021-09-09 23:50:00 4 2021-09-10 00:00:00 10
2021-09-09 00:00:00 4 2021-09-10 00:50:00 50
2021-09-10 00:50:00 4 2021-09-11 00:00:00 1390
2021-09-11 00:00:00 4 2021-09-12 00:00:00 1440
2021-09-12 00:00:00 4 2021-09-12 00:50:00 50
It should be able to handle situations in which we don't have an event for an entire day or more like in the example.
My current solution for now is:
first_record_hour_ts = df.index.floor('H')[0]
last_record_hour_ts = df.index.floor('H')[-1]
# Create a series from the first to the last date containing Nan
df_to_join = pd.Series(np.nan, index=pd.date_range(first_record_hour_ts, last_record_hour_ts, freq='H'))
df_to_join = pd.DataFrame(df_to_join)
# Concatenate with current status dataframe
df = pd.concat([df, df_to_join[~df_to_join.index.isin(df.index)]]).sort_index()
# Forward fill the nana
df.fillna(method='ffill', inplace=True)
df['next_event'] = df.index.shift(-1)
# Calculate the delta between the 2 status
df['duration'] = df['next_event'] - df.index
# Convert into minutes
df['duration_Minutes'] = df['duration_Minutes'].apply(lambda x: x.total_seconds() // 60)
This doesn't solve exactly the problem, but I think it may solve my goal which being able to group by event and by day at the end.
Ok, the code below looks a bit long -- and there's certainly a better/more efficient/shorter way of doing this. But I think it's pretty reasonably simple to follow along.
split_datetime_span_by_day below takes two dates: start_date and end_date. In your case, it would be date and next_event from your source data.
The function then checks whether that time period (start -> end) spans over midnight. If it doesn't, it returns the start date, the end date, and the time period in seconds. If it does span over midnight, it creates a new segment (start -> midnight), and then calls itself again (i.e. recurses), and the process continues until the time period does not span over midnight.
Just a note: the returned segment list is made up of tuples of (start, end, nmb_seconds). I'm returning the number of seconds, not the number of minutes as in your question, because I didn't know how you wanted to round the seconds (up, down, etc.). That's left as an exercise for the reader :-)
from datetime import datetime, timedelta
def split_datetime_span_by_day(start_date, end_date, split_segments=None):
assert start_date < end_date # sanity check
# when is the next midnight after start_date?
# adapted from https://ispycode.com/Blog/python/2016-07/Get-Midnight-Today
start_next_midnight = datetime.combine(start_date, datetime.min.time()) + timedelta(days=1)
if split_segments is None:
split_segments = []
if end_date < start_next_midnight:
# end date is before next midnight, no split necessary
return split_segments + [(
start_date,
end_date,
(end_date - start_date).total_seconds()
)]
# otherwise, split at next midnight...
split_segments += [(
start_date,
start_next_midnight,
(start_next_midnight - start_date).total_seconds()
)]
if (end_date - start_next_midnight).total_seconds() > 0:
# ...and recurse to get next segment
return split_datetime_span_by_day(
start_date=start_next_midnight,
end_date=end_date,
split_segments=split_segments
)
else:
# case where start_next_midnight == end_date i.e. end_date is midnight
# don't split & create a 0 second segment
return split_segments
# test case:
start_date = datetime.strptime('2021-09-12 00:00:00', '%Y-%m-%d %H:%M:%S')
end_date = datetime.strptime('2021-09-14 01:00:00', '%Y-%m-%d %H:%M:%S')
print(split_datetime_span_by_day(start_date=start_date, end_date=end_date))
# returned values:
# [
# (datetime.datetime(2021, 9, 12, 0, 0), datetime.datetime(2021, 9, 13, 0, 0), 86400.0),
# (datetime.datetime(2021, 9, 13, 0, 0), datetime.datetime(2021, 9, 14, 0, 0), 86400.0),
# (datetime.datetime(2021, 9, 14, 0, 0), datetime.datetime(2021, 9, 14, 1, 0), 3600.0)
# ]
Related
TLDR: When downsampling a Series with a DatetimeIndex, e.g. from hourly to daily values, how can I ensure the result only contains time periods that are fully present in the original?
Example
I'll explain with a simplified example.
Starting point: daily values
import pandas as pd
# Source data: 2 full days, AND SOME ADDITIONAL HOURS.
i = pd.date_range('2022-03-04 22:00', '2022-03-07 09:00', freq='H')
hourly = pd.Series(range(len(i)), i)
I want to resample to days, but keep only those that days are completely present in the source series.
What is working: calendar days
If a day is defined as a normal calendar day, i.e., midnight to midnight, we can do this in 2 steps:
# 1) Resample.
grouper = pd.Grouper(freq='D')
daily = hourly.groupby(grouper).sum() # or .resample('D').sum()
# 2022-03-04 1
# 2022-03-05 324
# 2022-03-06 900
# 2022-03-07 545
# Freq: D, dtype: int64
# 2) Discard incomplete days.
# (reject the days that start before the start of the first hour)
incomplete_left = daily.index < hourly.index[0]
# (reject the days that end after the end of the last hour)
incomplete_right = daily.index + pd.offsets.Day(1) > hourly.index[-1] + pd.offsets.Hour(1)
# Trim.
daily_trimmed = daily[~incomplete_left & ~incomplete_right] # Keeps 2022-03-05 and -06. Good.
# 2022-03-05 324
# 2022-03-06 900
# Freq: D, dtype: int64
Sofar, so good.
What is not working: custom starting point
But what if a day is defined as starting at 06:00 and ending at 06:00 the next calender day? I can do the resampling, but don't know how to check which timestamps to reject.
# 1) Resampling is doable:
import datetime
def gasday(ts: pd.Timestamp) -> pd.Timestamp:
day = ts.floor("D")
if ts.time() < datetime.time(hour=6):
day = day - pd.DateOffset(days=1) # get previous day
return day
daily2 = hourly.groupby(gasday).sum()
# 2022-03-04 28
# 2022-03-05 468
# 2022-03-06 1044
# 2022-03-07 230
# dtype: int64
# 2) ... but how to find the days that must be rejected??
Remarks
I'm using DatetimeIndex, instead of PeriodIndex, which is why I we have the somewhat complicated formula for incomplete_right. The reason for using DatetimeIndex is that I'm generally dealing with timezones (not shown in this example). The timestamps in the datetimeindex are left-bound.
In my use-case, I'm given the grouper function (gasday in this case), without knowing, what the cutoff time is (06:00 in this case).
If your data is guaranteed to be hourly, then you can just count the records:
daily = hourly.groupby(pd.Grouper(freq='D', offset='6H')).agg(['size','sum'])
Output:
size sum
2022-03-04 06:00:00 8 28
2022-03-05 06:00:00 24 468
2022-03-06 06:00:00 24 1044
2022-03-07 06:00:00 4 230
Looking form the data, it's fairly easy to see which ones should be dropped
complete_daily = daily.query('size==24')
Output:
size sum
2022-03-05 06:00:00 24 468
2022-03-06 06:00:00 24 1044
Update: You can also try:
daily = (hourly.reset_index().groupby(pd.Grouper(key='index', freq='D', offset='6H'))
.agg(start=('index','min'), end=('index','max'), total=(0,'sum'))
)
Output:
start end total
index
2022-03-04 06:00:00 2022-03-04 22:00:00 2022-03-05 05:00:00 28
2022-03-05 06:00:00 2022-03-05 06:00:00 2022-03-06 05:00:00 468
2022-03-06 06:00:00 2022-03-06 06:00:00 2022-03-07 05:00:00 1044
2022-03-07 06:00:00 2022-03-07 06:00:00 2022-03-07 09:00:00 230
You can then query for complete days, e.g.
daily['start'].dt.hour.eq(6) & daily['end'].dt.hour.eq(5)
I've decided to use the following logic:
If the first timestamp in hourly.index belongs to the same group (= same day) as the timestamp immediately before it, then the group is not fully present in the hourly series and the first datapoint in daily must be removed. If it does not belong to the same group, it is really the start of a new group, the group is fully present in hourly, and no change is needed to daily.
Likewise, if the end of the last timestamp in hourly.index belongs to the same day as the timestamp immediately before* it, then the group is also not fully present in the hourly series, and the final datapoint in daily must be removed.
eps = pd.Timedelta(seconds=1)
start = hourly.index[0]
if gasday(start) == gasday(start - eps):
daily2 = daily2.iloc[1:]
end = hourly.index[-1] + pd.offsets.Hour(1)
if gasday(end - eps) == gasday(end):
daily2 = daily2.iloc[:-1]
This works, and keeps the two days (2022-03-05 and -06) as wanted. It includes (-04) if we start i at 2022-03-04 06:00 or earlier. Likewise, it keeps (-06) only if we end i at 2022-03-07 05:00 or later.
*) Why before? Well, the 05:00 timestamp denotes the left-closed interval [05:00-06:00). 06:00 is actually the start of the next hour. Therefor, if this 06:00 timestamp belongs to the same day, as the moment immediately before it (05:59:59), then we do not have the complete day.
Now the only issues I have left is the following: I'd like to abstract this all away, like so:
def resample_and_trim(source, grouper):
agg = source.groupby(grouper).sum()
eps = pd.Timedelta(seconds=1)
start = source.index[0]
if grouper(start) == grouper(start - eps):
agg = agg.iloc[1:]
end = source.index[-1] + pd.offsets.Hour(1)
if grouper(end - eps) == grouper(end):
agg = agg.iloc[:-1]
return agg
And then be able to call this in both cases. The latter works:
daily2 = resample_and_trim(hourly, gasday)
# 2022-03-05 468
# 2022-03-06 1044
# dtype: int64
But the former does not:
daily = resample_and_trim(hourly, pd.Grouper(freq='H'))
# Error in `grouper(start)`
# TypeError: 'TimeGrouper' object is not callable
I'll doctor around a bit more; if I find the solution, I'll edit this answer.
I am trying to output the days on my calendar, something like: 2021-02-02 2021-02-03 2021-02-04 2021-02-05 etc.
I copied this code from https://www.tutorialbrain.com/python-calendar/ so I don't understand why I get the error.
import calendar
year = 2021
month = 2
cal_obj = calendar.Calendar(firstweekday=1)
dates = cal_obj.itermonthdays(year, month)
for i in dates:
i = str(i)
if i[6] == "2":
print(i, end="")
Error:
if i[6] == "2":
IndexError: string index out of range
Process finished with exit code 1
There is a difference between your code and their code. It's very subtle, but it's there:
Yours:
dates = cal_obj.itermonthdays(year, month)
^^^^ days
Theirs:
dates = cal_obj.itermonthdates(year, month)
^^^^^ dates
itermonthdays returns the days of the month as ints, while itermonthdates returns datetime.dates.
If your goal is to create a list of date of the calendar, you can use the following aswell :
import pandas as pd
from datetime import datetime
datelist = list(pd.date_range(start="2021/01/01", end="2021/12/31").strftime("%Y-%m-%d"))
datelist
You can choose any start date or end date (if that date exists)
Output :
['2021-01-01',
'2021-01-02',
'2021-01-03',
'2021-01-04',
'2021-01-05',
'2021-01-06',
'2021-01-07',
'2021-01-08',
'2021-01-09',
'2021-01-10',
'2021-01-11',
'2021-01-12',
...
'2021-12-28',
'2021-12-29',
'2021-12-30',
'2021-12-31']
Seems like you are new to python i[6] means index to an element of a list or list-like data type.
The same stuff can be achieved by using datetime library in the following way
import datetime
start_date = datetime.date(2021, 2, 1) # set the start date in from of (year, month, day)
no_of_days = 30 # no of days you wanna print
day_jump = datetime.timedelta(days=1) # No of days to jump with each iteration, defaut 1
end_date = start_date + no_of_days * day_jump # Seting up the end date
for i in range((end_date - start_date).days):
print(start_date + i * day_jump)
OUTPUT
2021-02-01
2021-02-02
2021-02-03
2021-02-04
2021-02-05
2021-02-06
2021-02-07
2021-02-08
2021-02-09
2021-02-10
2021-02-11
2021-02-12
2021-02-13
2021-02-14
2021-02-15
2021-02-16
2021-02-17
2021-02-18
2021-02-19
2021-02-20
2021-02-21
2021-02-22
2021-02-23
2021-02-24
2021-02-25
2021-02-26
2021-02-27
2021-02-28
2021-03-01
2021-03-02
This is my dataframe.
Start_hour End_date
23:58:00 00:26:00
23:56:00 00:01:00
23:18:00 23:36:00
How can I get in a new column the difference (in minutes) between these two columns?
>>> from datetime import datetime
>>>
>>> before = datetime.now()
>>> print('wait for more than 1 minute')
wait for more than 1 minute
>>> after = datetime.now()
>>> td = after - before
>>>
>>> td
datetime.timedelta(seconds=98, microseconds=389121)
>>> td.total_seconds()
98.389121
>>> td.total_seconds() / 60
1.6398186833333335
Then you can round it or use it as-is.
You can do something like this:
import pandas as pd
df = pd.DataFrame({
'Start_hour': ['23:58:00', '23:56:00', '23:18:00'],
'End_date': ['00:26:00', '00:01:00', '23:36:00']}
)
df['Start_hour'] = pd.to_datetime(df['Start_hour'])
df['End_date'] = pd.to_datetime(df['End_date'])
df['diff'] = df.apply(
lambda row: (row['End_date']-row['Start_hour']).seconds / 60,
axis=1
)
print(df)
Start_hour End_date diff
0 2021-03-29 23:58:00 2021-03-29 00:26:00 28.0
1 2021-03-29 23:56:00 2021-03-29 00:01:00 5.0
2 2021-03-29 23:18:00 2021-03-29 23:36:00 18.0
You can also rearrange your dates as string again if you like:
df['Start_hour'] = df['Start_hour'].apply(lambda x: x.strftime('%H:%M:%S'))
df['End_date'] = df['End_date'].apply(lambda x: x.strftime('%H:%M:%S'))
print(df)
Output:
Start_hour End_date diff
0 23:58:00 00:26:00 28.0
1 23:56:00 00:01:00 5.0
2 23:18:00 23:36:00 18.0
Short answer:
df['interval'] = df['End_date'] - df['Start_hour']
df['interval'][df['End_date'] < df['Start_hour']] += timedelta(hours=24)
Why so:
You probably trying to solve the problem that your Start_hout and End_date values sometimes belong to a different days, and that's why you can't just substutute one from the other.
It your time window never exceeds 24 hours interval, you could use some modular arithmetic to deal with 23:59:59 - 00:00:00 border:
if End_date < Start_hour, this always means End_date belongs to a next day
this implies, if End_date - Start_hour < 0 then we should add 24 hours to End_date to find the actual difference
The final formula is:
if rec['Start_hour'] < rec['End_date']:
offset = 0
else:
offset = timedelta(hours=24)
rec['delta'] = offset + rec['End_date'] - rec['Start_hour']
To do the same with pandas.DataFrame we need to change code accordingly. And
that's how we get the snippet from the beginning of the answer.
import pandas as pd
df = pd.DataFrame([
{'Start_hour': datetime(1, 1, 1, 23, 58, 0), 'End_date': datetime(1, 1, 1, 0, 26, 0)},
{'Start_hour': datetime(1, 1, 1, 23, 58, 0), 'End_date': datetime(1, 1, 1, 23, 59, 0)},
])
# ...
df['interval'] = df['End_date'] - df['Start_hour']
df['interval'][df['End_date'] < df['Start_hour']] += timedelta(hours=24)
> df
Start_hour End_date interval
0 0001-01-01 23:58:00 0001-01-01 00:26:00 0 days 00:28:00
1 0001-01-01 23:58:00 0001-01-01 23:59:00 0 days 00:01:00
My company uses a 4-4-5 calendar for reporting purposes. Each month (aka period) is 4-weeks long, except every 3rd month is 5-weeks long.
Pandas seems to have good support for custom calendar periods. However, I'm having trouble figuring out the correct frequency string or custom business month offset to achieve months for a 4-4-5 calendar.
For example:
df_index = pd.date_range("2020-03-29", "2021-03-27", freq="D", name="date")
df = pd.DataFrame(
index=df_index, columns=["a"], data=np.random.randint(0, 100, size=len(df_index))
)
df.groupby(pd.Grouper(level=0, freq="4W-SUN")).mean()
Grouping by 4-weeks starting on Sunday results in the following. The first three month start dates are correct but I need every third month to be 5-weeks long. The 4th month start date should be 2020-06-28.
a
date
2020-03-29 16.000000
2020-04-26 50.250000
2020-05-24 39.071429
2020-06-21 52.464286
2020-07-19 41.535714
2020-08-16 46.178571
2020-09-13 51.857143
2020-10-11 44.250000
2020-11-08 47.714286
2020-12-06 56.892857
2021-01-03 55.821429
2021-01-31 53.464286
2021-02-28 53.607143
2021-03-28 45.037037
Essentially what I'd like to achieve is something like this:
a
date
2020-03-29 20.000000
2020-04-26 50.750000
2020-05-24 49.750000
2020-06-28 49.964286
2020-07-26 52.214286
2020-08-23 47.714286
2020-09-27 46.250000
2020-10-25 53.357143
2020-11-22 52.035714
2020-12-27 39.750000
2021-01-24 43.428571
2021-02-21 49.392857
Pandas currently support only yearly and quarterly 5253 (aka 4-4-5 calendar).
See is pandas.tseries.offsets.FY5253 and pandas.tseries.offsets.FY5253Quarter
df_index = pd.date_range("2020-03-29", "2021-03-27", freq="D", name="date")
df = pd.DataFrame(index=df_index)
df['a'] = np.random.randint(0, 100, df.shape[0])
So indeed you need some more work to get to week level and maintain a 4-4-5 calendar. You could align to quarters using the native pandas offset and fill-in the 4-4-5 week pattern manually.
def date_range(start, end, offset_array, name=None):
start = pd.to_datetime(start)
end = pd.to_datetime(end)
index = []
start -= offset_array[0]
while(start<end):
for x in offset_array:
start += x
if start > end:
break
index.append(start)
return pd.Series(index, name=name)
This function takes a list of offsets rather than a regular frequency period, so it allows to move from date to date following the offsets in the given array:
offset_445 = [
pd.tseries.offsets.FY5253Quarter(weekday=6),
4*pd.tseries.offsets.Week(weekday=6),
4*pd.tseries.offsets.Week(weekday=6),
]
df_index_445 = date_range("2020-03-29", "2021-03-27", offset_445, name='date')
Out:
0 2020-05-03
1 2020-05-31
2 2020-06-28
3 2020-08-02
4 2020-08-30
5 2020-09-27
6 2020-11-01
7 2020-11-29
8 2020-12-27
9 2021-01-31
10 2021-02-28
Name: date, dtype: datetime64[ns]
Once the index is created, then it's back to aggregations logic to get the data in the right row buckets. Assuming that you want the mean for the start of each 4 or 5 week period, according to the df_index_445 you have generated, it could look like this:
# calculate the mean on reindex groups
reindex = df_index_445.searchsorted(df.index, side='right') - 1
res = df.groupby(reindex).mean()
# filter valid output
res = res[res.index>=0]
res.index = df_index_445
Out:
a
2020-05-03 47.857143
2020-05-31 53.071429
2020-06-28 49.257143
2020-08-02 40.142857
2020-08-30 47.250000
2020-09-27 52.485714
2020-11-01 48.285714
2020-11-29 56.178571
2020-12-27 51.428571
2021-01-31 50.464286
2021-02-28 53.642857
Note that since the frequency is not regular, pandas will set the datetime index frequency to None.
I have days like this:
eventday_idxs
2005-01-07 00:00:00
2005-01-31 00:00:00
2005-02-15 00:00:00
2005-04-18 00:00:00
2005-05-11 00:00:00
2005-08-12 00:00:00
2005-08-15 00:00:00
2005-09-06 00:00:00
2005-09-19 00:00:00
2005-10-12 00:00:00
2005-10-13 00:00:00
2005-10-20 00:00:00
2006-01-10 00:00:00
2006-01-30 00:00:00
2006-02-10 00:00:00
2006-03-29 00:00:00
I want to do calculations From : To ranges of it like this on AAPL stock dataset.
As i am beginner in Pandas i use loop and do like this.
aap1_10_years = pd.io.data.get_data_yahoo('AAPL',
start=datetime.datetime(2004, 12, 10),
end=datetime.datetime(2014, 12, 10))
one_day = timedelta(days=1)
for i,ind in enumerate(eventday_idxs):
try:
do_calculations(aapl_10_years[ ind: eventday_idxs[i+1] - one_day ]['High'])
except IndexError:
do_calculations(aapl_10_years[ ind:]['High'] )
How can i apply do_calcuations without loops like this? Because loops like this are discouraged in panda because are slow, right?
The time spans between the events are not regular:
In [141]: eventday_idxs.diff().head()
Out[141]:
0 NaT
1 24 days
2 15 days
3 62 days
4 23 days
Name: 0, dtype: timedelta64[ns]
so we can't express the calculation using rolling_apply. However, if we could
assign an "event number" to each of the rows in aap1_10_years, then we could
groupby these event numbers and apply do_calculations to each group.
If we define:
# mark each event day with a 1
aap1_10_years.loc[eventday_idxs, 'event'] = 1
# use cumsum to assign an event number to each event range
aap1_10_years['event'] = aap1_10_years['event'].fillna(0).cumsum()
then aap1_10_years['event'] equals 1 for these rows:
In [144]: aap1_10_years.loc[aap1_10_years['event'] == 1, ['Close', 'event']]
Out[144]:
Close event
Date
2005-01-07 69.25 1
2005-01-10 68.96 1
2005-01-11 64.56 1
2005-01-12 65.46 1
2005-01-13 69.80 1
2005-01-14 70.20 1
2005-01-18 70.65 1
2005-01-19 69.88 1
2005-01-20 70.46 1
2005-01-21 70.49 1
2005-01-24 70.76 1
2005-01-25 72.05 1
2005-01-26 72.25 1
2005-01-27 72.64 1
2005-01-28 73.98 1
Thus event number 1 has been assigned to all the rows with dates between
2005-01-07 and 2005-01-28. And similarly, each of the other event ranges have been assigned a unique event number.
import datetime as DT
import pandas as pd
import pandas.io.data as pdata
eventday_idxs = pd.to_datetime(pd.read_table('data', header=None)[0])
aap1_10_years = pdata.get_data_yahoo(
'AAPL',
start=DT.datetime(2004, 12, 10),
end=DT.datetime(2014, 12, 10))
# mark each event day with a 1
aap1_10_years.loc[eventday_idxs, 'event'] = 1
# use cumsum to assign an event number to each event range
aap1_10_years['event'] = aap1_10_years['event'].fillna(0).cumsum()
mask = aap1_10_years['event'] > 0
aap1_10_years.loc[mask].groupby(['event'])['High'].apply(do_calculations)