Replace loop with groupby for rolling daily calculation - python

How do I modify my code to have Pandas rolling daily reset each day? Please see desired output below as this shows exactly what I am trying to achieve.
I think I may need to use groupby to get the same result but unsure how to progress.
Data
np.random.seed(5)
series = pd.Series(np.random.choice([1,3,5], 10), index = pd.date_range('2014-01-01', '2014-01-04', freq = '8h'))
series
2014-01-01 00:00:00 5
2014-01-01 08:00:00 3
2014-01-01 16:00:00 5
2014-01-02 00:00:00 5
2014-01-02 08:00:00 1
2014-01-02 16:00:00 3
2014-01-03 00:00:00 1
2014-01-03 08:00:00 1
2014-01-03 16:00:00 5
2014-01-04 00:00:00 1
Output after pandas rolling
series.rolling('D', min_periods=1).min()
2014-01-01 00:00:00 5.0
2014-01-01 08:00:00 3.0
2014-01-01 16:00:00 3.0
2014-01-02 00:00:00 3.0
2014-01-02 08:00:00 1.0
2014-01-02 16:00:00 1.0
2014-01-03 00:00:00 1.0
2014-01-03 08:00:00 1.0
2014-01-03 16:00:00 1.0
2014-01-04 00:00:00 1.0
Desired output (reset each day)
I can get the desired output like this but want to avoid looping:
series_list = []
for i in set(series.index.date):
series_list.append(series.loc[str(i)].rolling('D', min_periods=1).min())
pd.concat(series_list).sort_index()
2014-01-01 00:00:00 5.0
2014-01-01 08:00:00 3.0
2014-01-01 16:00:00 3.0
2014-01-02 00:00:00 5.0
2014-01-02 08:00:00 1.0
2014-01-02 16:00:00 1.0
2014-01-03 00:00:00 1.0
2014-01-03 08:00:00 1.0
2014-01-03 16:00:00 1.0
2014-01-04 00:00:00 1.0

series.groupby(series.index.date).cummin()
Output:
2014-01-01 00:00:00 5
2014-01-01 08:00:00 3
2014-01-01 16:00:00 3
2014-01-02 00:00:00 5
2014-01-02 08:00:00 1
2014-01-02 16:00:00 1
2014-01-03 00:00:00 1
2014-01-03 08:00:00 1
2014-01-03 16:00:00 1
2014-01-04 00:00:00 1
Freq: 8H, dtype: int64

Related

Pandas groupby and get yesterday min value

How do I modify my code to have groupby return the previous days min instead of current days min Please see desired output below as this shows exactly what I am trying to achieve.
Data
np.random.seed(5)
series = pd.Series(np.random.choice([1,3,5], 10), index = pd.date_range('2014-01-01', '2014-01-04', freq = '8h'))
series
2014-01-01 00:00:00 5
2014-01-01 08:00:00 3
2014-01-01 16:00:00 5
2014-01-02 00:00:00 5
2014-01-02 08:00:00 1
2014-01-02 16:00:00 3
2014-01-03 00:00:00 1
2014-01-03 08:00:00 1
2014-01-03 16:00:00 5
2014-01-04 00:00:00 1
Output after groupby
series.groupby(series.index.date).transform(min)
2014-01-01 00:00:00 3
2014-01-01 08:00:00 3
2014-01-01 16:00:00 3
2014-01-02 00:00:00 1
2014-01-02 08:00:00 1
2014-01-02 16:00:00 1
2014-01-03 00:00:00 1
2014-01-03 08:00:00 1
2014-01-03 16:00:00 1
2014-01-04 00:00:00 1
Desired output (yesterday min)
2014-01-01 00:00:00 Nan
2014-01-01 08:00:00 Nan
2014-01-01 16:00:00 Nan
2014-01-02 00:00:00 3
2014-01-02 08:00:00 3
2014-01-02 16:00:00 3
2014-01-03 00:00:00 1
2014-01-03 08:00:00 1
2014-01-03 16:00:00 1
2014-01-04 00:00:00 1
You can swap the index to just the date, calculate min per day, shift it and swap the original index back:
# Swap the index to just the date component
s = series.set_axis(series.index.date)
# Calculate the min per day, and shift it
t = s.groupby(level=0).min().shift()
# Final assembly
s[t.index] = t
s.index = series.index
Let us do reindex
series[:] = series.groupby(series.index.date).min().shift().reindex(series.index.date)
series
Out[370]:
2014-01-01 00:00:00 NaN
2014-01-01 08:00:00 NaN
2014-01-01 16:00:00 NaN
2014-01-02 00:00:00 1.0
2014-01-02 08:00:00 1.0
2014-01-02 16:00:00 1.0
2014-01-03 00:00:00 3.0
2014-01-03 08:00:00 3.0
2014-01-03 16:00:00 3.0
2014-01-04 00:00:00 1.0
Freq: 8H, dtype: float64

pandas calculate duration between datetime but not considering specific time range

For clarity here is MRE:
df = pd.DataFrame(
{"id":[1,2,3,4],
"start_time":["2020-06-01 01:00:00", "2020-06-01 01:00:00", "2020-06-01 19:00:00", "2020-06-02 04:00:00"],
"end_time":["2020-06-01 14:00:00", "2020-06-01 18:00:00", "2020-06-02 10:00:00", "2020-06-02 16:00:00"]
})
df["start_time"] = pd.to_datetime(df["start_time"])
df["end_time"] = pd.to_datetime(df["end_time"])
df["sub_time"] = df["end_time"] - df["start_time"]
this outputs:
id start_time end_time sub_time
0 1 2020-06-01 01:00:00 2020-06-01 14:00:00 13:00:00
1 2 2020-06-01 01:00:00 2020-06-01 18:00:00 17:00:00
2 3 2020-06-01 19:00:00 2020-06-02 10:00:00 15:00:00
3 4 2020-06-02 04:00:00 2020-06-02 16:00:00 12:00:00
but when start_time ~ end_time consists of times range 00:00:00~03:59:59am I want to ignore it(not calculated in sub_time)
So instead of output above I would get:
id start_time end_time sub_time
0 1 2020-06-01 01:00:00 2020-06-01 14:00:00 10:00:00
1 2 2020-06-01 01:00:00 2020-06-01 18:00:00 14:00:00
2 3 2020-06-01 19:00:00 2020-06-02 10:00:00 11:00:00
3 4 2020-06-02 04:00:00 2020-06-02 16:00:00 12:00:00
row 0: starting at 01:00:00 do not count until 04:00:00. then 04:00:00 ~ 14:00:00 is 10 hour period
row 2: consider duration from 19:00:00 ~ 24:00:00 and 04:00:00 ~ 10:00:00 thus we get 11:00:00 in sub_time column.
Any suggestions?

How do I display a subset of a pandas dataframe?

I have a dataframe df that contains datetimes for every hour of a day between 2003-02-12 to 2017-06-30 and I want to delete all datetimes between 24th Dec and 1st Jan of EVERY year.
An extract of my data frame is:
...
7505,2003-12-23 17:00:00
7506,2003-12-23 18:00:00
7507,2003-12-23 19:00:00
7508,2003-12-23 20:00:00
7509,2003-12-23 21:00:00
7510,2003-12-23 22:00:00
7511,2003-12-23 23:00:00
7512,2003-12-24 00:00:00
7513,2003-12-24 01:00:00
7514,2003-12-24 02:00:00
7515,2003-12-24 03:00:00
7516,2003-12-24 04:00:00
7517,2003-12-24 05:00:00
7518,2003-12-24 06:00:00
...
7723,2004-01-01 19:00:00
7724,2004-01-01 20:00:00
7725,2004-01-01 21:00:00
7726,2004-01-01 22:00:00
7727,2004-01-01 23:00:00
7728,2004-01-02 00:00:00
7729,2004-01-02 01:00:00
7730,2004-01-02 02:00:00
7731,2004-01-02 03:00:00
7732,2004-01-02 04:00:00
7733,2004-01-02 05:00:00
7734,2004-01-02 06:00:00
7735,2004-01-02 07:00:00
...
and my expected output is:
...
7505,2003-12-23 17:00:00
7506,2003-12-23 18:00:00
7507,2003-12-23 19:00:00
7508,2003-12-23 20:00:00
7509,2003-12-23 21:00:00
7510,2003-12-23 22:00:00
7511,2003-12-23 23:00:00
...
7728,2004-01-02 00:00:00
7729,2004-01-02 01:00:00
7730,2004-01-02 02:00:00
7731,2004-01-02 03:00:00
7732,2004-01-02 04:00:00
7733,2004-01-02 05:00:00
7734,2004-01-02 06:00:00
7735,2004-01-02 07:00:00
...
Sample dataframe:
dates
0 2003-12-23 23:00:00
1 2003-12-24 05:00:00
2 2004-12-27 05:00:00
3 2003-12-13 23:00:00
4 2002-12-23 23:00:00
5 2004-01-01 05:00:00
6 2014-12-24 05:00:00
Solution:
If you want it for every year between the following dates excluded, then extract the month and dates first:
df['month'] = df['dates'].dt.month
df['day'] = df['dates'].dt.day
And now put the condition check:
dec_days = [24, 25, 26, 27, 28, 29, 30, 31]
## if the month is dec, then check for these dates
## if the month is jan, then just check for the day to be 1 like below
df = df[~(((df.month == 12) & (df.day.isin(dec_days))) | ((df.month == 1) & (df.day == 1)))]
Sample output:
dates month day
0 2003-12-23 23:00:00 12 23
3 2003-12-13 23:00:00 12 13
4 2002-12-23 23:00:00 12 23
This takes advantage of the fact that datetime-strings in the form mm-dd are sortable. Read everything in from the CSV file then filter for the dates you want:
df = pd.read_csv('...', parse_dates=['DateTime'])
s = df['DateTime'].dt.strftime('%m-%d')
excluded = (s == '01-01') | (s >= '12-24') # Jan 1 or >= Dec 24
df[~excluded]
You can try dropping on conditionals. Maybe with a pattern match to the date string or parsing the date as a number (like in Java) and conditionally removing.
datesIdontLike = df[df['colname'] == <stringPattern>].index
newDF = df.drop(datesIdontLike, inplace=True)
Check this out: https://thispointer.com/python-pandas-how-to-drop-rows-in-dataframe-by-conditions-on-column-values/
(If you have issues, let me know.)
You can use pandas and boolean filtering with strftime
# version 0.23.4
import pandas as pd
# make df
df = pd.DataFrame(pd.date_range('20181223', '20190103', freq='H'), columns=['date'])
# string format the date to only include the month and day
# then set it strictly less than '12-24' AND greater than or equal to `01-02`
df = df.loc[
(df.date.dt.strftime('%m-%d') < '12-24') &
(df.date.dt.strftime('%m-%d') >= '01-02')
].copy()
print(df)
date
0 2018-12-23 00:00:00
1 2018-12-23 01:00:00
2 2018-12-23 02:00:00
3 2018-12-23 03:00:00
4 2018-12-23 04:00:00
5 2018-12-23 05:00:00
6 2018-12-23 06:00:00
7 2018-12-23 07:00:00
8 2018-12-23 08:00:00
9 2018-12-23 09:00:00
10 2018-12-23 10:00:00
11 2018-12-23 11:00:00
12 2018-12-23 12:00:00
13 2018-12-23 13:00:00
14 2018-12-23 14:00:00
15 2018-12-23 15:00:00
16 2018-12-23 16:00:00
17 2018-12-23 17:00:00
18 2018-12-23 18:00:00
19 2018-12-23 19:00:00
20 2018-12-23 20:00:00
21 2018-12-23 21:00:00
22 2018-12-23 22:00:00
23 2018-12-23 23:00:00
240 2019-01-02 00:00:00
241 2019-01-02 01:00:00
242 2019-01-02 02:00:00
243 2019-01-02 03:00:00
244 2019-01-02 04:00:00
245 2019-01-02 05:00:00
246 2019-01-02 06:00:00
247 2019-01-02 07:00:00
248 2019-01-02 08:00:00
249 2019-01-02 09:00:00
250 2019-01-02 10:00:00
251 2019-01-02 11:00:00
252 2019-01-02 12:00:00
253 2019-01-02 13:00:00
254 2019-01-02 14:00:00
255 2019-01-02 15:00:00
256 2019-01-02 16:00:00
257 2019-01-02 17:00:00
258 2019-01-02 18:00:00
259 2019-01-02 19:00:00
260 2019-01-02 20:00:00
261 2019-01-02 21:00:00
262 2019-01-02 22:00:00
263 2019-01-02 23:00:00
264 2019-01-03 00:00:00
This will work with multiple years because we are only filtering on the month and day.
# change range to include 2017
df = pd.DataFrame(pd.date_range('20171223', '20190103', freq='H'), columns=['date'])
df = df.loc[
(df.date.dt.strftime('%m-%d') < '12-24') &
(df.date.dt.strftime('%m-%d') >= '01-02')
].copy()
print(df)
date
0 2017-12-23 00:00:00
1 2017-12-23 01:00:00
2 2017-12-23 02:00:00
3 2017-12-23 03:00:00
4 2017-12-23 04:00:00
5 2017-12-23 05:00:00
6 2017-12-23 06:00:00
7 2017-12-23 07:00:00
8 2017-12-23 08:00:00
9 2017-12-23 09:00:00
10 2017-12-23 10:00:00
11 2017-12-23 11:00:00
12 2017-12-23 12:00:00
13 2017-12-23 13:00:00
14 2017-12-23 14:00:00
15 2017-12-23 15:00:00
16 2017-12-23 16:00:00
17 2017-12-23 17:00:00
18 2017-12-23 18:00:00
19 2017-12-23 19:00:00
20 2017-12-23 20:00:00
21 2017-12-23 21:00:00
22 2017-12-23 22:00:00
23 2017-12-23 23:00:00
240 2018-01-02 00:00:00
241 2018-01-02 01:00:00
242 2018-01-02 02:00:00
243 2018-01-02 03:00:00
244 2018-01-02 04:00:00
245 2018-01-02 05:00:00
... ...
8779 2018-12-23 19:00:00
8780 2018-12-23 20:00:00
8781 2018-12-23 21:00:00
8782 2018-12-23 22:00:00
8783 2018-12-23 23:00:00
9000 2019-01-02 00:00:00
9001 2019-01-02 01:00:00
9002 2019-01-02 02:00:00
9003 2019-01-02 03:00:00
9004 2019-01-02 04:00:00
9005 2019-01-02 05:00:00
9006 2019-01-02 06:00:00
9007 2019-01-02 07:00:00
9008 2019-01-02 08:00:00
9009 2019-01-02 09:00:00
9010 2019-01-02 10:00:00
9011 2019-01-02 11:00:00
9012 2019-01-02 12:00:00
9013 2019-01-02 13:00:00
9014 2019-01-02 14:00:00
9015 2019-01-02 15:00:00
9016 2019-01-02 16:00:00
9017 2019-01-02 17:00:00
9018 2019-01-02 18:00:00
9019 2019-01-02 19:00:00
9020 2019-01-02 20:00:00
9021 2019-01-02 21:00:00
9022 2019-01-02 22:00:00
9023 2019-01-02 23:00:00
9024 2019-01-03 00:00:00
Since you want this to happen for every year, we can first define a series that where we replace the year by a static value (2000 for example). Let date be the column that stores the date, we can generate such column as:
dt = pd.to_datetime({'year': 2000, 'month': df['date'].dt.month, 'day': df['date'].dt.day})
For the given sample data, we get:
>>> dt
0 2000-12-23
1 2000-12-23
2 2000-12-23
3 2000-12-23
4 2000-12-23
5 2000-12-23
6 2000-12-23
7 2000-12-24
8 2000-12-24
9 2000-12-24
10 2000-12-24
11 2000-12-24
12 2000-12-24
13 2000-12-24
14 2000-01-01
15 2000-01-01
16 2000-01-01
17 2000-01-01
18 2000-01-01
19 2000-01-02
20 2000-01-02
21 2000-01-02
22 2000-01-02
23 2000-01-02
24 2000-01-02
25 2000-01-02
26 2000-01-02
dtype: datetime64[ns]
Next we can filter the rows, like:
from datetime import date
df[(dt >= date(2000,1,2)) & (dt < date(2000,12,24))]
This gives us the following data for your sample data:
>>> df[(dt >= date(2000,1,2)) & (dt < date(2000,12,24))]
id dt
0 7505 2003-12-23 17:00:00
1 7506 2003-12-23 18:00:00
2 7507 2003-12-23 19:00:00
3 7508 2003-12-23 20:00:00
4 7509 2003-12-23 21:00:00
5 7510 2003-12-23 22:00:00
6 7511 2003-12-23 23:00:00
19 7728 2004-01-02 00:00:00
20 7729 2004-01-02 01:00:00
21 7730 2004-01-02 02:00:00
22 7731 2004-01-02 03:00:00
23 7732 2004-01-02 04:00:00
24 7733 2004-01-02 05:00:00
25 7734 2004-01-02 06:00:00
26 7735 2004-01-02 07:00:00
So regardless what the year is, we will only consider dates between the 2nd of January and the 23rd of December (both inclusive).

pandas dataframe new column which checks previous day

I have a Dataframe which has a Datetime as Index and a column named "Holiday" which is an Flag with 1 or 0.
So if the datetimeindex is a holiday, the Holiday column has 1 in it and if not so 0.
I need a new column that says whether a given datetimeindex is the first day after a holiday or not.The new column should just look if its previous day has the flag "HOLIDAY" set to 1 and then set its flag to 1, otherwise 0.
EDIT
Doing:
df['DayAfter'] = df.Holiday.shift(1).fillna(0)
Has the Output:
Holiday DayAfter AnyNumber
Datum
...
2014-01-01 20:00:00 1 1.0 9
2014-01-01 20:30:00 1 1.0 2
2014-01-01 21:00:00 1 1.0 3
2014-01-01 21:30:00 1 1.0 3
2014-01-01 22:00:00 1 1.0 6
2014-01-01 22:30:00 1 1.0 1
2014-01-01 23:00:00 1 1.0 1
2014-01-01 23:30:00 1 1.0 1
2014-01-02 00:00:00 0 1.0 1
2014-01-02 00:30:00 0 0.0 2
2014-01-02 01:00:00 0 0.0 1
2014-01-02 01:30:00 0 0.0 1
...
if you check the first timestamp for 2014-01-02 the DayAfter flag is set right. But the other flags are 0. Thats wrong.
Create an array of unique days that are holidays and offset them by one day
days = pd.Series(df[df.Holiday == 1].index).add(pd.DateOffset(1)).dt.date.unique()
Create a new column with the one day holiday offsets (days)
df['DayAfter'] = np.where(pd.Series(df.index).dt.date.isin(days),1,0)
Holiday AnyNumber DayAfter
Datum
2014-01-01 20:00:00 1 9 0
2014-01-01 20:30:00 1 2 0
2014-01-01 21:00:00 1 3 0
2014-01-01 21:30:00 1 3 0
2014-01-01 22:00:00 1 6 0
2014-01-01 22:30:00 1 1 0
2014-01-01 23:00:00 1 1 0
2014-01-01 23:30:00 1 1 0
2014-01-02 00:00:00 0 1 1
2014-01-02 00:30:00 0 2 1
2014-01-02 01:00:00 0 1 1
2014-01-02 01:30:00 0 1 1

Create pandas column from different time serie

I have an indicator in a stored in a dataframe with a 1 hour time serie, and I'd like to extract it to a dataframe with a 1 min time serie, but I would like this indicator to have a value for each possible index of the one minute array, instead of every hour.
I tried the following code:
import pandas as pd
import datetime
DAX_H1 = pd.read_csv("D:\Finance python\Data\DAX\GER30_H1_fixed.csv",index_col='date',error_bad_lines=False)
DAX_M1 = pd.read_csv("D:\Finance python\Data\DAX\GER30_m1_fixed.csv",index_col='date',error_bad_lines=False)
DAX_M1=DAX_M1.iloc[0:6000]
DAX_H1=DAX_H1.iloc[0:110]
DAX_H1["EWMA_8"]=DAX_H1["Open"].ewm(min_periods=8,com=3.5).mean()
DAX_H1.index =pd.to_datetime(DAX_H1.index)
DAX_M1.index =pd.to_datetime(DAX_M1.index)
DAX_M1["H1_EWMA_8"]=""
for i in DAX_M1.index:
DAX_M1["H1_EWMA_8"][i] = DAX_H1["EWMA_8"][pd.Timestamp(datetime.datetime(i.year,i.month,i.day,i.hour))]
However, it doesn't seem to work, and even if it worked, I'd assume it would be very slow.
If I simply do the following:
DAX_M1["H1_EWMA_8"]=DAX_H1["EWMA_8"]
I don't have a value for "H1_EWMA_8" for each index:
2014-01-02 16:53:00 NaN
2014-01-02 16:54:00 NaN
2014-01-02 16:55:00 NaN
2014-01-02 16:56:00 NaN
2014-01-02 16:57:00 NaN
2014-01-02 16:58:00 NaN
2014-01-02 16:59:00 NaN
2014-01-02 17:00:00 9449.979026
2014-01-02 17:01:00 NaN
2014-01-02 17:02:00 NaN
Name: H1_EWMA_8, dtype: float64
Is there a simple way to replace the NaN value with the last available value of "H1_EWMA_8"?
Some part of DAX_M1 and DAX_H1 for illustration purposes:
DAX_M1:
Open High Low Close Total Ticks
date
2014-01-01 22:00:00 9597 9597 9597 9597 1
2014-01-02 07:05:00 9597 9619 9597 9618 18
2014-01-02 07:06:00 9618 9621 9617 9621 5
2014-01-02 07:07:00 9621 9623 9620 9623 6
2014-01-02 07:08:00 9623 9625 9623 9625 9
2014-01-02 07:09:00 9625 9625 9622 9623 6
2014-01-02 07:10:00 9623 9625 9622 9624 13
2014-01-02 07:11:00 9624 9626 9624 9626 8
2014-01-02 07:12:00 9626 9626 9623 9623 9
2014-01-02 07:13:00 9623 9625 9623 9625 5
DAX_H1:
Open High Low Close Total Ticks EWMA_8
date
2014-01-01 22:00:00 9597 9597 9597 9597 1 NaN
2014-01-02 07:00:00 9597 9626 9597 9607 322 NaN
2014-01-02 08:00:00 9607 9617 9510 9535 1730 NaN
2014-01-02 09:00:00 9535 9537 9465 9488 1428 NaN
2014-01-02 10:00:00 9488 9505 9478 9490 637 NaN
2014-01-02 11:00:00 9490 9512 9473 9496 817 NaN
2014-01-02 12:00:00 9496 9510 9495 9504 450 NaN
2014-01-02 13:00:00 9504 9514 9484 9484 547 9518.123073
2014-01-02 14:00:00 9484 9493 9424 9436 1497 9509.658500
Any help would be gladly appreciated!
Edit: This solution worked:
DAX_M1 = DAX_M1.fillna(method='ffill')

Categories