Pandas groupby and get yesterday min value - python

How do I modify my code to have groupby return the previous days min instead of current days min Please see desired output below as this shows exactly what I am trying to achieve.
Data
np.random.seed(5)
series = pd.Series(np.random.choice([1,3,5], 10), index = pd.date_range('2014-01-01', '2014-01-04', freq = '8h'))
series
2014-01-01 00:00:00 5
2014-01-01 08:00:00 3
2014-01-01 16:00:00 5
2014-01-02 00:00:00 5
2014-01-02 08:00:00 1
2014-01-02 16:00:00 3
2014-01-03 00:00:00 1
2014-01-03 08:00:00 1
2014-01-03 16:00:00 5
2014-01-04 00:00:00 1
Output after groupby
series.groupby(series.index.date).transform(min)
2014-01-01 00:00:00 3
2014-01-01 08:00:00 3
2014-01-01 16:00:00 3
2014-01-02 00:00:00 1
2014-01-02 08:00:00 1
2014-01-02 16:00:00 1
2014-01-03 00:00:00 1
2014-01-03 08:00:00 1
2014-01-03 16:00:00 1
2014-01-04 00:00:00 1
Desired output (yesterday min)
2014-01-01 00:00:00 Nan
2014-01-01 08:00:00 Nan
2014-01-01 16:00:00 Nan
2014-01-02 00:00:00 3
2014-01-02 08:00:00 3
2014-01-02 16:00:00 3
2014-01-03 00:00:00 1
2014-01-03 08:00:00 1
2014-01-03 16:00:00 1
2014-01-04 00:00:00 1

You can swap the index to just the date, calculate min per day, shift it and swap the original index back:
# Swap the index to just the date component
s = series.set_axis(series.index.date)
# Calculate the min per day, and shift it
t = s.groupby(level=0).min().shift()
# Final assembly
s[t.index] = t
s.index = series.index

Let us do reindex
series[:] = series.groupby(series.index.date).min().shift().reindex(series.index.date)
series
Out[370]:
2014-01-01 00:00:00 NaN
2014-01-01 08:00:00 NaN
2014-01-01 16:00:00 NaN
2014-01-02 00:00:00 1.0
2014-01-02 08:00:00 1.0
2014-01-02 16:00:00 1.0
2014-01-03 00:00:00 3.0
2014-01-03 08:00:00 3.0
2014-01-03 16:00:00 3.0
2014-01-04 00:00:00 1.0
Freq: 8H, dtype: float64

Related

Replace loop with groupby for rolling daily calculation

How do I modify my code to have Pandas rolling daily reset each day? Please see desired output below as this shows exactly what I am trying to achieve.
I think I may need to use groupby to get the same result but unsure how to progress.
Data
np.random.seed(5)
series = pd.Series(np.random.choice([1,3,5], 10), index = pd.date_range('2014-01-01', '2014-01-04', freq = '8h'))
series
2014-01-01 00:00:00 5
2014-01-01 08:00:00 3
2014-01-01 16:00:00 5
2014-01-02 00:00:00 5
2014-01-02 08:00:00 1
2014-01-02 16:00:00 3
2014-01-03 00:00:00 1
2014-01-03 08:00:00 1
2014-01-03 16:00:00 5
2014-01-04 00:00:00 1
Output after pandas rolling
series.rolling('D', min_periods=1).min()
2014-01-01 00:00:00 5.0
2014-01-01 08:00:00 3.0
2014-01-01 16:00:00 3.0
2014-01-02 00:00:00 3.0
2014-01-02 08:00:00 1.0
2014-01-02 16:00:00 1.0
2014-01-03 00:00:00 1.0
2014-01-03 08:00:00 1.0
2014-01-03 16:00:00 1.0
2014-01-04 00:00:00 1.0
Desired output (reset each day)
I can get the desired output like this but want to avoid looping:
series_list = []
for i in set(series.index.date):
series_list.append(series.loc[str(i)].rolling('D', min_periods=1).min())
pd.concat(series_list).sort_index()
2014-01-01 00:00:00 5.0
2014-01-01 08:00:00 3.0
2014-01-01 16:00:00 3.0
2014-01-02 00:00:00 5.0
2014-01-02 08:00:00 1.0
2014-01-02 16:00:00 1.0
2014-01-03 00:00:00 1.0
2014-01-03 08:00:00 1.0
2014-01-03 16:00:00 1.0
2014-01-04 00:00:00 1.0
series.groupby(series.index.date).cummin()
Output:
2014-01-01 00:00:00 5
2014-01-01 08:00:00 3
2014-01-01 16:00:00 3
2014-01-02 00:00:00 5
2014-01-02 08:00:00 1
2014-01-02 16:00:00 1
2014-01-03 00:00:00 1
2014-01-03 08:00:00 1
2014-01-03 16:00:00 1
2014-01-04 00:00:00 1
Freq: 8H, dtype: int64

Pandas - Conditional resampling on MultiIndex based DataFrame based on a boolean column

I have a df which has a MultiIndex [(latitude, longitude, time)] with the number of rows being 148 x 244 x 90 x 24. For each latitude and longitude, the time is hourly from 2014-01-01 00:00:00 to 2014:03:31 23:00:00.
FFDI isInRange
latitude longitude time
-39.20000 140.80000 2014-01-01 00:00:00 6.20000 True
2014-01-01 01:00:00 4.10000 True
2014-01-01 02:00:00 2.40000 True
2014-01-01 03:00:00 1.90000 True
2014-01-01 04:00:00 1.70000 True
2014-01-01 05:00:00 1.50000 True
2014-01-01 06:00:00 1.40000 True
2014-01-01 07:00:00 1.30000 True
2014-01-01 08:00:00 1.20000 True
2014-01-01 09:00:00 1.00000 True
2014-01-01 10:00:00 1.00000 True
2014-01-01 11:00:00 0.90000 True
2014-01-01 12:00:00 0.90000 True
... ... ... ...
2014-03-31 21:00:00 0.30000 False
2014-03-31 22:00:00 0.30000 False
2014-03-31 23:00:00 0.50000 False
140.83786 2014-01-01 00:00:00 3.20000 True
2014-01-01 01:00:00 2.90000 True
2014-01-01 02:00:00 2.10000 True
2014-01-01 03:00:00 2.90000 True
2014-01-01 04:00:00 1.20000 True
2014-01-01 05:00:00 0.90000 True
2014-01-01 06:00:00 1.10000 True
2014-01-01 07:00:00 1.60000 True
2014-01-01 08:00:00 1.40000 True
2014-01-01 09:00:00 1.50000 True
2014-01-01 10:00:00 1.20000 True
2014-01-01 11:00:00 0.80000 True
2014-01-01 12:00:00 0.40000 True
... ... ... ...
2014-03-31 21:00:00 0.30000 False
2014-03-31 22:00:00 0.30000 False
2014-03-31 23:00:00 0.50000 False
... ... ... ...
... ... ...
-33.90000 140.80000 2014-01-01 00:00:00 6.20000 True
2014-01-01 01:00:00 4.10000 True
2014-01-01 02:00:00 2.40000 True
2014-01-01 03:00:00 1.90000 True
2014-01-01 04:00:00 1.70000 True
2014-01-01 05:00:00 1.50000 True
2014-01-01 06:00:00 1.40000 True
2014-01-01 07:00:00 1.30000 True
2014-01-01 08:00:00 1.20000 True
2014-01-01 09:00:00 1.00000 True
2014-01-01 10:00:00 1.00000 True
2014-01-01 11:00:00 0.90000 True
2014-01-01 12:00:00 0.90000 True
... ... ... ...
2014-03-31 21:00:00 0.30000 False
2014-03-31 22:00:00 0.30000 False
2014-03-31 23:00:00 0.50000 False
140.83786 2014-01-01 00:00:00 3.20000 True
2014-01-01 01:00:00 2.90000 True
2014-01-01 02:00:00 2.10000 True
2014-01-01 03:00:00 2.90000 True
2014-01-01 04:00:00 1.20000 True
2014-01-01 05:00:00 0.90000 True
2014-01-01 06:00:00 1.10000 True
2014-01-01 07:00:00 1.60000 True
2014-01-01 08:00:00 1.40000 True
2014-01-01 09:00:00 1.50000 True
2014-01-01 10:00:00 1.20000 True
2014-01-01 11:00:00 0.80000 True
2014-01-01 12:00:00 0.40000 True
... ... ... ...
2014-03-31 21:00:00 0.30000 False
2014-03-31 22:00:00 0.30000 False
2014-03-31 23:00:00 0.50000 False
78001920 rows × 1 columns
What I want to achieve is to calculate a daily maximum FFDI value for every 24 hours for each latitude and longitude on the condition of:
If isInRange = True for all 24 hours/rows in the group - use FFDI from 13:00:00 of previous day to 12:00:00 of next day
If isInRange = False for all 24 hours/rows in the group - use FFDI from 14:00:00 of previous day to 13:00:00 of next day
Then my code is:
df_daily_max = df.groupby(['latitude', 'longitude', pd.Grouper(freq='24H',base=13,loffset='11H',label='right',level='time')])['FFDI'].max().reset_index(name='Max FFDI') if df['isInRange'] else isInRange.groupby(['latitude', 'longitude', pd.Grouper(freq='24H',base=14,loffset='10H',label='right',level='time')])['FFDI'].max().reset_index(name='Max FFDI')
However this line raised an error:
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
You can filter first all True rows and then all Falses rows for aggregate max, then join by concat, sorting MultiIndex and convert to DataFrame by Series.reset_index:
s1 = df[df['isInRange']].groupby(['latitude', 'longitude', pd.Grouper(freq='24H',base=13,loffset='11H',label='right',level='time')])['FFDI'].max()
s2 = df[~df['isInRange']].groupby(['latitude', 'longitude', pd.Grouper(freq='24H',base=14,loffset='10H',label='right',level='time')])['FFDI'].max()
df_daily_max = pd.concat([s1, s2]).sort_index().reset_index(name='Max FFDI')

pandas calculate duration between datetime but not considering specific time range

For clarity here is MRE:
df = pd.DataFrame(
{"id":[1,2,3,4],
"start_time":["2020-06-01 01:00:00", "2020-06-01 01:00:00", "2020-06-01 19:00:00", "2020-06-02 04:00:00"],
"end_time":["2020-06-01 14:00:00", "2020-06-01 18:00:00", "2020-06-02 10:00:00", "2020-06-02 16:00:00"]
})
df["start_time"] = pd.to_datetime(df["start_time"])
df["end_time"] = pd.to_datetime(df["end_time"])
df["sub_time"] = df["end_time"] - df["start_time"]
this outputs:
id start_time end_time sub_time
0 1 2020-06-01 01:00:00 2020-06-01 14:00:00 13:00:00
1 2 2020-06-01 01:00:00 2020-06-01 18:00:00 17:00:00
2 3 2020-06-01 19:00:00 2020-06-02 10:00:00 15:00:00
3 4 2020-06-02 04:00:00 2020-06-02 16:00:00 12:00:00
but when start_time ~ end_time consists of times range 00:00:00~03:59:59am I want to ignore it(not calculated in sub_time)
So instead of output above I would get:
id start_time end_time sub_time
0 1 2020-06-01 01:00:00 2020-06-01 14:00:00 10:00:00
1 2 2020-06-01 01:00:00 2020-06-01 18:00:00 14:00:00
2 3 2020-06-01 19:00:00 2020-06-02 10:00:00 11:00:00
3 4 2020-06-02 04:00:00 2020-06-02 16:00:00 12:00:00
row 0: starting at 01:00:00 do not count until 04:00:00. then 04:00:00 ~ 14:00:00 is 10 hour period
row 2: consider duration from 19:00:00 ~ 24:00:00 and 04:00:00 ~ 10:00:00 thus we get 11:00:00 in sub_time column.
Any suggestions?

pandas dataframe new column which checks previous day

I have a Dataframe which has a Datetime as Index and a column named "Holiday" which is an Flag with 1 or 0.
So if the datetimeindex is a holiday, the Holiday column has 1 in it and if not so 0.
I need a new column that says whether a given datetimeindex is the first day after a holiday or not.The new column should just look if its previous day has the flag "HOLIDAY" set to 1 and then set its flag to 1, otherwise 0.
EDIT
Doing:
df['DayAfter'] = df.Holiday.shift(1).fillna(0)
Has the Output:
Holiday DayAfter AnyNumber
Datum
...
2014-01-01 20:00:00 1 1.0 9
2014-01-01 20:30:00 1 1.0 2
2014-01-01 21:00:00 1 1.0 3
2014-01-01 21:30:00 1 1.0 3
2014-01-01 22:00:00 1 1.0 6
2014-01-01 22:30:00 1 1.0 1
2014-01-01 23:00:00 1 1.0 1
2014-01-01 23:30:00 1 1.0 1
2014-01-02 00:00:00 0 1.0 1
2014-01-02 00:30:00 0 0.0 2
2014-01-02 01:00:00 0 0.0 1
2014-01-02 01:30:00 0 0.0 1
...
if you check the first timestamp for 2014-01-02 the DayAfter flag is set right. But the other flags are 0. Thats wrong.
Create an array of unique days that are holidays and offset them by one day
days = pd.Series(df[df.Holiday == 1].index).add(pd.DateOffset(1)).dt.date.unique()
Create a new column with the one day holiday offsets (days)
df['DayAfter'] = np.where(pd.Series(df.index).dt.date.isin(days),1,0)
Holiday AnyNumber DayAfter
Datum
2014-01-01 20:00:00 1 9 0
2014-01-01 20:30:00 1 2 0
2014-01-01 21:00:00 1 3 0
2014-01-01 21:30:00 1 3 0
2014-01-01 22:00:00 1 6 0
2014-01-01 22:30:00 1 1 0
2014-01-01 23:00:00 1 1 0
2014-01-01 23:30:00 1 1 0
2014-01-02 00:00:00 0 1 1
2014-01-02 00:30:00 0 2 1
2014-01-02 01:00:00 0 1 1
2014-01-02 01:30:00 0 1 1

Create pandas column from different time serie

I have an indicator in a stored in a dataframe with a 1 hour time serie, and I'd like to extract it to a dataframe with a 1 min time serie, but I would like this indicator to have a value for each possible index of the one minute array, instead of every hour.
I tried the following code:
import pandas as pd
import datetime
DAX_H1 = pd.read_csv("D:\Finance python\Data\DAX\GER30_H1_fixed.csv",index_col='date',error_bad_lines=False)
DAX_M1 = pd.read_csv("D:\Finance python\Data\DAX\GER30_m1_fixed.csv",index_col='date',error_bad_lines=False)
DAX_M1=DAX_M1.iloc[0:6000]
DAX_H1=DAX_H1.iloc[0:110]
DAX_H1["EWMA_8"]=DAX_H1["Open"].ewm(min_periods=8,com=3.5).mean()
DAX_H1.index =pd.to_datetime(DAX_H1.index)
DAX_M1.index =pd.to_datetime(DAX_M1.index)
DAX_M1["H1_EWMA_8"]=""
for i in DAX_M1.index:
DAX_M1["H1_EWMA_8"][i] = DAX_H1["EWMA_8"][pd.Timestamp(datetime.datetime(i.year,i.month,i.day,i.hour))]
However, it doesn't seem to work, and even if it worked, I'd assume it would be very slow.
If I simply do the following:
DAX_M1["H1_EWMA_8"]=DAX_H1["EWMA_8"]
I don't have a value for "H1_EWMA_8" for each index:
2014-01-02 16:53:00 NaN
2014-01-02 16:54:00 NaN
2014-01-02 16:55:00 NaN
2014-01-02 16:56:00 NaN
2014-01-02 16:57:00 NaN
2014-01-02 16:58:00 NaN
2014-01-02 16:59:00 NaN
2014-01-02 17:00:00 9449.979026
2014-01-02 17:01:00 NaN
2014-01-02 17:02:00 NaN
Name: H1_EWMA_8, dtype: float64
Is there a simple way to replace the NaN value with the last available value of "H1_EWMA_8"?
Some part of DAX_M1 and DAX_H1 for illustration purposes:
DAX_M1:
Open High Low Close Total Ticks
date
2014-01-01 22:00:00 9597 9597 9597 9597 1
2014-01-02 07:05:00 9597 9619 9597 9618 18
2014-01-02 07:06:00 9618 9621 9617 9621 5
2014-01-02 07:07:00 9621 9623 9620 9623 6
2014-01-02 07:08:00 9623 9625 9623 9625 9
2014-01-02 07:09:00 9625 9625 9622 9623 6
2014-01-02 07:10:00 9623 9625 9622 9624 13
2014-01-02 07:11:00 9624 9626 9624 9626 8
2014-01-02 07:12:00 9626 9626 9623 9623 9
2014-01-02 07:13:00 9623 9625 9623 9625 5
DAX_H1:
Open High Low Close Total Ticks EWMA_8
date
2014-01-01 22:00:00 9597 9597 9597 9597 1 NaN
2014-01-02 07:00:00 9597 9626 9597 9607 322 NaN
2014-01-02 08:00:00 9607 9617 9510 9535 1730 NaN
2014-01-02 09:00:00 9535 9537 9465 9488 1428 NaN
2014-01-02 10:00:00 9488 9505 9478 9490 637 NaN
2014-01-02 11:00:00 9490 9512 9473 9496 817 NaN
2014-01-02 12:00:00 9496 9510 9495 9504 450 NaN
2014-01-02 13:00:00 9504 9514 9484 9484 547 9518.123073
2014-01-02 14:00:00 9484 9493 9424 9436 1497 9509.658500
Any help would be gladly appreciated!
Edit: This solution worked:
DAX_M1 = DAX_M1.fillna(method='ffill')

Categories