I have a pandas series with two columns and lots of rows like:
r =
1-10-2010 3.4
1-11-2010 4.5
1-12-2010 3.7
... ...
What I'd like to do is to remove days of the week not in a custom week. So to remove Fridays and Saturdays, do something like this:
r = amazingfunction(r, ('Sun', 'Mon', 'Tue', 'Wed', Thu'))
r =
1-10-2010 3.4
1-11-2010 4.5
1-12-2010 3.7
1-13-2010 3.4
1-14-2010 4.1
1-17-2010 4.5
1-18-2010 3.7
... ...
How can I go about this?
You can use dt.dayofweek and isin to filter the df, here Friday and Saturday are 4,5 respectively, we negate the boolean mask using ~:
In [12]:
df = pd.DataFrame({'dates':pd.date_range(dt.datetime(2015,1,1), dt.datetime(2015,2,1))})
df['dayofweek'] = df['dates'].dt.dayofweek
df
Out[12]:
dates dayofweek
0 2015-01-01 3
1 2015-01-02 4
2 2015-01-03 5
3 2015-01-04 6
4 2015-01-05 0
5 2015-01-06 1
6 2015-01-07 2
7 2015-01-08 3
8 2015-01-09 4
9 2015-01-10 5
10 2015-01-11 6
11 2015-01-12 0
12 2015-01-13 1
13 2015-01-14 2
14 2015-01-15 3
15 2015-01-16 4
16 2015-01-17 5
17 2015-01-18 6
18 2015-01-19 0
19 2015-01-20 1
20 2015-01-21 2
21 2015-01-22 3
22 2015-01-23 4
23 2015-01-24 5
24 2015-01-25 6
25 2015-01-26 0
26 2015-01-27 1
27 2015-01-28 2
28 2015-01-29 3
29 2015-01-30 4
30 2015-01-31 5
31 2015-02-01 6
In [13]:
df[~df['dates'].dt.dayofweek.isin([4,5])]
Out[13]:
dates dayofweek
0 2015-01-01 3
3 2015-01-04 6
4 2015-01-05 0
5 2015-01-06 1
6 2015-01-07 2
7 2015-01-08 3
10 2015-01-11 6
11 2015-01-12 0
12 2015-01-13 1
13 2015-01-14 2
14 2015-01-15 3
17 2015-01-18 6
18 2015-01-19 0
19 2015-01-20 1
20 2015-01-21 2
21 2015-01-22 3
24 2015-01-25 6
25 2015-01-26 0
26 2015-01-27 1
27 2015-01-28 2
28 2015-01-29 3
31 2015-02-01 6
EDIT
As your data is a Series your dates are your index so the following should work:
r[~r.index.dayofweek.isin([4,5])]
Related
I have a question of comparing data of datetime64[ns] and date like '2017-01-01'.
here is the code:
df.loc[(df['Date'] >= datetime.date(2017.1.1), 'TimeRange'] = '2017.1'
but , an error has been showed and said descriptor 'date' requires a 'datetime.datetime' object but received a 'int'.
how can i compare a datetime64 to data (2017-01-01 or 2-17-6-1 and likes)
Thanks
Demo:
Source DF:
In [83]: df = pd.DataFrame({'tm':pd.date_range('2000-01-01', freq='9999T', periods=20)})
In [84]: df
Out[84]:
tm
0 2000-01-01 00:00:00
1 2000-01-07 22:39:00
2 2000-01-14 21:18:00
3 2000-01-21 19:57:00
4 2000-01-28 18:36:00
5 2000-02-04 17:15:00
6 2000-02-11 15:54:00
7 2000-02-18 14:33:00
8 2000-02-25 13:12:00
9 2000-03-03 11:51:00
10 2000-03-10 10:30:00
11 2000-03-17 09:09:00
12 2000-03-24 07:48:00
13 2000-03-31 06:27:00
14 2000-04-07 05:06:00
15 2000-04-14 03:45:00
16 2000-04-21 02:24:00
17 2000-04-28 01:03:00
18 2000-05-04 23:42:00
19 2000-05-11 22:21:00
Filtering:
In [85]: df.loc[df.tm > '2000-03-01']
Out[85]:
tm
9 2000-03-03 11:51:00
10 2000-03-10 10:30:00
11 2000-03-17 09:09:00
12 2000-03-24 07:48:00
13 2000-03-31 06:27:00
14 2000-04-07 05:06:00
15 2000-04-14 03:45:00
16 2000-04-21 02:24:00
17 2000-04-28 01:03:00
18 2000-05-04 23:42:00
19 2000-05-11 22:21:00
In [86]: df.loc[df.tm > '2000-3-1']
Out[86]:
tm
9 2000-03-03 11:51:00
10 2000-03-10 10:30:00
11 2000-03-17 09:09:00
12 2000-03-24 07:48:00
13 2000-03-31 06:27:00
14 2000-04-07 05:06:00
15 2000-04-14 03:45:00
16 2000-04-21 02:24:00
17 2000-04-28 01:03:00
18 2000-05-04 23:42:00
19 2000-05-11 22:21:00
not standard date format:
In [87]: df.loc[df.tm > pd.to_datetime('03/01/2000')]
Out[87]:
tm
9 2000-03-03 11:51:00
10 2000-03-10 10:30:00
11 2000-03-17 09:09:00
12 2000-03-24 07:48:00
13 2000-03-31 06:27:00
14 2000-04-07 05:06:00
15 2000-04-14 03:45:00
16 2000-04-21 02:24:00
17 2000-04-28 01:03:00
18 2000-05-04 23:42:00
19 2000-05-11 22:21:00
You need to ensure that the data you're comparing it with is also in the same format. Assuming that you have two datetime objects, you can do it like this:
import datetime
print(df.loc[(df['Date'] >= datetime.date(2017, 1, 1), 'TimeRange'])
This will create a datetime object and list out the filtered results. You can also assign the results an updated value as you have mentioned above.
I have this dataframe df:
U,Datetime
01,2015-01-01 20:00:00
01,2015-02-01 20:05:00
01,2015-04-01 21:00:00
01,2015-05-01 22:00:00
01,2015-07-01 22:05:00
02,2015-08-01 20:00:00
02,2015-09-01 21:00:00
02,2014-01-01 23:00:00
02,2014-02-01 22:05:00
02,2015-01-01 20:00:00
02,2014-03-01 21:00:00
03,2015-10-01 20:00:00
03,2015-11-01 21:00:00
03,2015-12-01 23:00:00
03,2015-01-01 22:05:00
03,2015-02-01 20:00:00
03,2015-05-01 21:00:00
03,2014-01-01 20:00:00
03,2014-02-01 21:00:00
made by U and a Datetime object. What I would like to do is to filter U values having at least three consecutive occurrences in months/year. So far I have grouped by by U, year and month as:
m = df.groupby(['U',df.index.year,df.index.month]).size()
obtaining:
U
1 2015 1 1
2 1
4 1
5 1
7 1
2 2014 1 1
2 1
3 1
2015 1 1
8 1
9 1
3 2014 1 1
2 1
2015 1 1
2 1
5 1
10 1
11 1
12 1
The third column is related to the occurrences in different months/year. In this case only U values of 02 and 03 contain at least three consecutive values in months/year. Now I can't figured out how can I select those users and getting them out in a list, for instance, or just keeping them in the original dataframe df and discard the others. I tried also:
g = m.groupby(level=[0,1]).diff()
But I can't get any useful information.
Finally I could come up with the solution :) .
to give you an idea of how custom function works , simply it subtracts the value of the month from it's preceding value , the result should be one of course , and this should happen twice , for example if you have a list of numbers [5 , 6 , 7] , so 7 - 6 = 1 and 6 - 5 = 1 , 1 here appeared twice so the condition has been fulfilled
In [80]:
df.reset_index(inplace=True)
In [281]:
df['month'] = df.Datetime.dt.month
df['year'] = df.Datetime.dt.year
df
Out[281]:
Datetime U month year
0 2015-01-01 20:00:00 1 1 2015
1 2015-02-01 20:05:00 1 2 2015
2 2015-04-01 21:00:00 1 4 2015
3 2015-05-01 22:00:00 1 5 2015
4 2015-07-01 22:05:00 1 7 2015
5 2015-08-01 20:00:00 2 8 2015
6 2015-09-01 21:00:00 2 9 2015
7 2014-01-01 23:00:00 2 1 2014
8 2014-02-01 22:05:00 2 2 2014
9 2015-01-01 20:00:00 2 1 2015
10 2014-03-01 21:00:00 2 3 2014
11 2015-10-01 20:00:00 3 10 2015
12 2015-11-01 21:00:00 3 11 2015
13 2015-12-01 23:00:00 3 12 2015
14 2015-01-01 22:05:00 3 1 2015
15 2015-02-01 20:00:00 3 2 2015
16 2015-05-01 21:00:00 3 5 2015
17 2014-01-01 20:00:00 3 1 2014
18 2014-02-01 21:00:00 3 2 2014
In [284]:
g = df.groupby([df['U'] , df.year])
In [86]:
res = g.filter(lambda x : is_at_least_three_consec(x['month'].diff().values.tolist()))
res
Out[86]:
Datetime U month year
7 2014-01-01 23:00:00 2 1 2014
8 2014-02-01 22:05:00 2 2 2014
10 2014-03-01 21:00:00 2 3 2014
11 2015-10-01 20:00:00 3 10 2015
12 2015-11-01 21:00:00 3 11 2015
13 2015-12-01 23:00:00 3 12 2015
14 2015-01-01 22:05:00 3 1 2015
15 2015-02-01 20:00:00 3 2 2015
16 2015-05-01 21:00:00 3 5 2015
if you want to see the result of the custom function
In [84]:
res = g['month'].agg(lambda x : is_at_least_three_consec(x.diff().values.tolist()))
res
Out[84]:
U year
1 2015 False
2 2014 True
2015 False
3 2014 False
2015 True
Name: month, dtype: bool
this is how custom function implemented
In [53]:
def is_at_least_three_consec(month_diff):
consec_count = 0
#print(month_diff)
for index , val in enumerate(month_diff):
if index != 0 and val == 1:
consec_count += 1
if consec_count == 2:
return True
else:
consec_count = 0
return False
I have a dataframe that can be simplified as:
date id
0 02/04/2015 02:34 1
1 06/04/2015 12:34 2
2 09/04/2015 23:03 3
3 12/04/2015 01:00 4
4 15/04/2015 07:12 5
5 21/04/2015 12:59 6
6 29/04/2015 17:33 7
7 04/05/2015 10:44 8
8 06/05/2015 11:12 9
9 10/05/2015 08:52 10
10 12/05/2015 14:19 11
11 19/05/2015 19:22 12
12 27/05/2015 22:31 13
13 01/06/2015 11:09 14
14 04/06/2015 12:57 15
15 10/06/2015 04:00 16
16 15/06/2015 03:23 17
17 19/06/2015 05:37 18
18 23/06/2015 13:41 19
19 27/06/2015 15:43 20
It can be created using:
tempDF = pd.DataFrame({ 'id': [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20],
'date': ["02/04/2015 02:34","06/04/2015 12:34","09/04/2015 23:03","12/04/2015 01:00","15/04/2015 07:12","21/04/2015 12:59","29/04/2015 17:33","04/05/2015 10:44","06/05/2015 11:12","10/05/2015 08:52","12/05/2015 14:19","19/05/2015 19:22","27/05/2015 22:31","01/06/2015 11:09","04/06/2015 12:57","10/06/2015 04:00","15/06/2015 03:23","19/06/2015 05:37","23/06/2015 13:41","27/06/2015 15:43"]})
The data has the following types:
tempDF.dtypes
date object
id int64
dtype: object
I have set the 'date' variable to be Pandas datefime64 format (if that's the right way to describe it) using:
import numpy as np
import pandas as pd
tempDF['date'] = pd_to_datetime(tempDF['date'])
So now, the dtypes look like:
tempDF.dtypes
date datetime64[ns]
id int64
dtype: object
I want to change the hours of the original date data. I can use .normalize() to convert to midnight via the .dt accessor:
tempDF['date'] = tempDF['date'].dt.normalize()
And, I can get access to individual datetime components (e.g. year) using:
tempDF['date'].dt.year
This produces:
0 2015
1 2015
2 2015
3 2015
4 2015
5 2015
6 2015
7 2015
8 2015
9 2015
10 2015
11 2015
12 2015
13 2015
14 2015
15 2015
16 2015
17 2015
18 2015
19 2015
Name: date, dtype: int64
The question is, how can I change specific date and time components? For example, how could I change the midday (12:00) for all the dates? I've found that datetime.datetime has a .replace() function. However, having converted dates to Pandas format, it would make sense to keep in that format. Is there a way to do that without changing the format again?
EDIT :
A vectorized way to do this would be to normalize the series, and then add 12 hours to it using timedelta. Example -
tempDF['date'].dt.normalize() + datetime.timedelta(hours=12)
Demo -
In [59]: tempDF
Out[59]:
date id
0 2015-02-04 12:00:00 1
1 2015-06-04 12:00:00 2
2 2015-09-04 12:00:00 3
3 2015-12-04 12:00:00 4
4 2015-04-15 12:00:00 5
5 2015-04-21 12:00:00 6
6 2015-04-29 12:00:00 7
7 2015-04-05 12:00:00 8
8 2015-06-05 12:00:00 9
9 2015-10-05 12:00:00 10
10 2015-12-05 12:00:00 11
11 2015-05-19 12:00:00 12
12 2015-05-27 12:00:00 13
13 2015-01-06 12:00:00 14
14 2015-04-06 12:00:00 15
15 2015-10-06 12:00:00 16
16 2015-06-15 12:00:00 17
17 2015-06-19 12:00:00 18
18 2015-06-23 12:00:00 19
19 2015-06-27 12:00:00 20
In [60]: tempDF['date'].dt.normalize() + datetime.timedelta(hours=12)
Out[60]:
0 2015-02-04 12:00:00
1 2015-06-04 12:00:00
2 2015-09-04 12:00:00
3 2015-12-04 12:00:00
4 2015-04-15 12:00:00
5 2015-04-21 12:00:00
6 2015-04-29 12:00:00
7 2015-04-05 12:00:00
8 2015-06-05 12:00:00
9 2015-10-05 12:00:00
10 2015-12-05 12:00:00
11 2015-05-19 12:00:00
12 2015-05-27 12:00:00
13 2015-01-06 12:00:00
14 2015-04-06 12:00:00
15 2015-10-06 12:00:00
16 2015-06-15 12:00:00
17 2015-06-19 12:00:00
18 2015-06-23 12:00:00
19 2015-06-27 12:00:00
dtype: datetime64[ns]
Timing information for both methods at bottom
One method would be to use Series.apply along with the .replace() method OP mentions in his post. Example -
tempDF['date'] = tempDF['date'].apply(lambda x:x.replace(hour=12,minute=0))
Demo -
In [12]: tempDF
Out[12]:
date id
0 2015-02-04 02:34:00 1
1 2015-06-04 12:34:00 2
2 2015-09-04 23:03:00 3
3 2015-12-04 01:00:00 4
4 2015-04-15 07:12:00 5
5 2015-04-21 12:59:00 6
6 2015-04-29 17:33:00 7
7 2015-04-05 10:44:00 8
8 2015-06-05 11:12:00 9
9 2015-10-05 08:52:00 10
10 2015-12-05 14:19:00 11
11 2015-05-19 19:22:00 12
12 2015-05-27 22:31:00 13
13 2015-01-06 11:09:00 14
14 2015-04-06 12:57:00 15
15 2015-10-06 04:00:00 16
16 2015-06-15 03:23:00 17
17 2015-06-19 05:37:00 18
18 2015-06-23 13:41:00 19
19 2015-06-27 15:43:00 20
In [13]: tempDF['date'] = tempDF['date'].apply(lambda x:x.replace(hour=12,minute=0))
In [14]: tempDF
Out[14]:
date id
0 2015-02-04 12:00:00 1
1 2015-06-04 12:00:00 2
2 2015-09-04 12:00:00 3
3 2015-12-04 12:00:00 4
4 2015-04-15 12:00:00 5
5 2015-04-21 12:00:00 6
6 2015-04-29 12:00:00 7
7 2015-04-05 12:00:00 8
8 2015-06-05 12:00:00 9
9 2015-10-05 12:00:00 10
10 2015-12-05 12:00:00 11
11 2015-05-19 12:00:00 12
12 2015-05-27 12:00:00 13
13 2015-01-06 12:00:00 14
14 2015-04-06 12:00:00 15
15 2015-10-06 12:00:00 16
16 2015-06-15 12:00:00 17
17 2015-06-19 12:00:00 18
18 2015-06-23 12:00:00 19
19 2015-06-27 12:00:00 20
Timing information
In [52]: df = pd.DataFrame([[datetime.datetime.now()] for _ in range(100000)],columns=['date'])
In [54]: %%timeit
....: df['date'].dt.normalize() + datetime.timedelta(hours=12)
....:
The slowest run took 12.53 times longer than the fastest. This could mean that an intermediate result is being cached
1 loops, best of 3: 32.3 ms per loop
In [57]: %%timeit
....: df['date'].apply(lambda x:x.replace(hour=12,minute=0))
....:
1 loops, best of 3: 1.09 s per loop
Here's the solution I used to replace the time component of the datetime values in a Pandas DataFrame. Not sure how efficient this solution is, but it fit my needs.
import pandas as pd
# Create a list of EOCY dates for a specified period
sDate = pd.Timestamp('2022-01-31 23:59:00')
eDate = pd.Timestamp('2060-01-31 23:59:00')
dtList = pd.date_range(sDate, eDate, freq='Y').to_pydatetime()
# Create a DataFrame with a single column called 'Date' and fill the rows with the list of EOCY dates.
df = pd.DataFrame({'Date': dtList})
# Loop through the DataFrame rows using the replace function to replace the hours and minutes of each date value.
for i in range(df.shape[0]):
df.iloc[i, 0]=df.iloc[i, 0].replace(hour=00, minute=00)
Not sure how efficient this solution is, but it fit my needs.
I have a dataframe that contains a column with a date (StartTime) in the following format: 28-7-2015 0:09:00 the same dataframe contains also a column that contains the number of seconds (SetupDuration1).
I would like to create a new column that subtracts the number of seconds from the datefield,
dftask['Start'] = dftask['StartTime'] - dftask['SetupDuration1']
The SetupDuration1 column is a numeric column and must stay a numeric column because I do different operations on this column, take absolute value etc.
So how should I subtract the number of seconds in the correct way. ?
apply a lambda to convert to timedelta and then subtract:
In [88]:
df = pd.DataFrame({'StartTime':pd.date_range(start=dt.datetime(2015,1,1), end = dt.datetime(2015,2,1)), 'SetupDuration1':np.random.randint(0, 59, size=32)})
df
Out[88]:
SetupDuration1 StartTime
0 14 2015-01-01
1 55 2015-01-02
2 21 2015-01-03
3 50 2015-01-04
4 21 2015-01-05
5 6 2015-01-06
6 6 2015-01-07
7 2 2015-01-08
8 10 2015-01-09
9 3 2015-01-10
10 11 2015-01-11
11 32 2015-01-12
12 53 2015-01-13
13 45 2015-01-14
14 48 2015-01-15
15 23 2015-01-16
16 7 2015-01-17
17 5 2015-01-18
18 18 2015-01-19
19 26 2015-01-20
20 48 2015-01-21
21 8 2015-01-22
22 58 2015-01-23
23 24 2015-01-24
24 47 2015-01-25
25 10 2015-01-26
26 32 2015-01-27
27 26 2015-01-28
28 36 2015-01-29
29 36 2015-01-30
30 40 2015-01-31
31 18 2015-02-01
In [94]:
df['Start'] = df['StartTime'] - df['SetupDuration1'].apply(lambda x: pd.Timedelta(x, 's'))
df
Out[94]:
SetupDuration1 StartTime Start
0 14 2015-01-01 2014-12-31 23:59:46
1 55 2015-01-02 2015-01-01 23:59:05
2 21 2015-01-03 2015-01-02 23:59:39
3 50 2015-01-04 2015-01-03 23:59:10
4 21 2015-01-05 2015-01-04 23:59:39
5 6 2015-01-06 2015-01-05 23:59:54
6 6 2015-01-07 2015-01-06 23:59:54
7 2 2015-01-08 2015-01-07 23:59:58
8 10 2015-01-09 2015-01-08 23:59:50
9 3 2015-01-10 2015-01-09 23:59:57
10 11 2015-01-11 2015-01-10 23:59:49
11 32 2015-01-12 2015-01-11 23:59:28
12 53 2015-01-13 2015-01-12 23:59:07
13 45 2015-01-14 2015-01-13 23:59:15
14 48 2015-01-15 2015-01-14 23:59:12
15 23 2015-01-16 2015-01-15 23:59:37
16 7 2015-01-17 2015-01-16 23:59:53
17 5 2015-01-18 2015-01-17 23:59:55
18 18 2015-01-19 2015-01-18 23:59:42
19 26 2015-01-20 2015-01-19 23:59:34
20 48 2015-01-21 2015-01-20 23:59:12
21 8 2015-01-22 2015-01-21 23:59:52
22 58 2015-01-23 2015-01-22 23:59:02
23 24 2015-01-24 2015-01-23 23:59:36
24 47 2015-01-25 2015-01-24 23:59:13
25 10 2015-01-26 2015-01-25 23:59:50
26 32 2015-01-27 2015-01-26 23:59:28
27 26 2015-01-28 2015-01-27 23:59:34
28 36 2015-01-29 2015-01-28 23:59:24
29 36 2015-01-30 2015-01-29 23:59:24
30 40 2015-01-31 2015-01-30 23:59:20
31 18 2015-02-01 2015-01-31 23:59:42
Timings
Actually it looks quicker to just construct a Timedeltaindex inplace:
In [99]:
%timeit df['Start'] = df['StartTime'] - pd.TimedeltaIndex(df['SetupDuration1'], unit='s')
1000 loops, best of 3: 837 µs per loop
In [100]:
%timeit df['Start'] = df['StartTime'] - df['SetupDuration1'].apply(lambda x: pd.Timedelta(x, 's'))
100 loops, best of 3: 1.97 ms per loop
So I'd just do:
In [101]:
df['Start'] = df['StartTime'] - pd.TimedeltaIndex(df['SetupDuration1'], unit='s')
df
Out[101]:
SetupDuration1 StartTime Start
0 14 2015-01-01 2014-12-31 23:59:46
1 55 2015-01-02 2015-01-01 23:59:05
2 21 2015-01-03 2015-01-02 23:59:39
3 50 2015-01-04 2015-01-03 23:59:10
4 21 2015-01-05 2015-01-04 23:59:39
5 6 2015-01-06 2015-01-05 23:59:54
6 6 2015-01-07 2015-01-06 23:59:54
7 2 2015-01-08 2015-01-07 23:59:58
8 10 2015-01-09 2015-01-08 23:59:50
9 3 2015-01-10 2015-01-09 23:59:57
10 11 2015-01-11 2015-01-10 23:59:49
11 32 2015-01-12 2015-01-11 23:59:28
12 53 2015-01-13 2015-01-12 23:59:07
13 45 2015-01-14 2015-01-13 23:59:15
14 48 2015-01-15 2015-01-14 23:59:12
15 23 2015-01-16 2015-01-15 23:59:37
16 7 2015-01-17 2015-01-16 23:59:53
17 5 2015-01-18 2015-01-17 23:59:55
18 18 2015-01-19 2015-01-18 23:59:42
19 26 2015-01-20 2015-01-19 23:59:34
20 48 2015-01-21 2015-01-20 23:59:12
21 8 2015-01-22 2015-01-21 23:59:52
22 58 2015-01-23 2015-01-22 23:59:02
23 24 2015-01-24 2015-01-23 23:59:36
24 47 2015-01-25 2015-01-24 23:59:13
25 10 2015-01-26 2015-01-25 23:59:50
26 32 2015-01-27 2015-01-26 23:59:28
27 26 2015-01-28 2015-01-27 23:59:34
28 36 2015-01-29 2015-01-28 23:59:24
29 36 2015-01-30 2015-01-29 23:59:24
30 40 2015-01-31 2015-01-30 23:59:20
31 18 2015-02-01 2015-01-31 23:59:42
I have the following dataframe:
date value
2014-01-20 10
2014-01-21 12
2014-01-22 13
2014-01-23 9
2014-01-24 7
2014-01-25 12
2014-01-26 11
I need to be able to keep track of when the latest maximum and minimum value occurred within a specific rolling window. For example if I were to use a rolling window period of 5, then I would need an output like the following:
date value rolling_max_date rolling_min_date
2014-01-20 10 2014-01-20 2014-01-20
2014-01-21 12 2014-01-21 2014-01-20
2014-01-22 13 2014-01-22 2014-01-20
2014-01-23 9 2014-01-22 2014-01-23
2014-01-24 7 2014-01-22 2014-01-24
2014-01-25 12 2014-01-22 2014-01-24
2014-01-26 11 2014-01-25 2014-01-24
All this shows is, what is the date of the latest maximum and minimum value within the rolling window. I know pandas has rolling_min and rolling_max, but im not sure how to keep track of the index/date of when the most recent max/min occured within the window.
There is a more general rolling_apply where you can provide your own function. However, the custom functions receives the windows as arrays, not dataframes, so the index information is not available (so you cannot use idxmin/max).
But lets try to achieve this in two steps:
In [41]: df = df.set_index('date')
In [42]: pd.rolling_apply(df, window=5, func=lambda x: x.argmin(), min_periods=1)
Out[42]:
value
date
2014-01-20 0
2014-01-21 0
2014-01-22 0
2014-01-23 3
2014-01-24 4
2014-01-25 3
2014-01-26 2
This gives you the index in the window where the minimum is found. But, this index is for that particular window and not for the entire dataframe. So let's add the start of the window, and then use this integer location to retrieve the correct index locations index:
In [45]: ilocs_window = pd.rolling_apply(df, window=5, func=lambda x: x.argmin(), min_periods=1)
In [46]: ilocs = ilocs_window['value'] + ([0, 0, 0, 0] + range(len(ilocs_window)-4))
In [47]: ilocs
Out[47]:
date
2014-01-20 0
2014-01-21 0
2014-01-22 0
2014-01-23 3
2014-01-24 4
2014-01-25 4
2014-01-26 4
Name: value, dtype: float64
In [48]: df.index.take(ilocs)
Out[48]:
Index([u'2014-01-20', u'2014-01-20', u'2014-01-20', u'2014-01-23',
u'2014-01-24', u'2014-01-24', u'2014-01-24'],
dtype='object', name=u'date')
In [49]: df['rolling_min_date'] = df.index.take(ilocs)
In [50]: df
Out[50]:
value rolling_min_date
date
2014-01-20 10 2014-01-20
2014-01-21 12 2014-01-20
2014-01-22 13 2014-01-20
2014-01-23 9 2014-01-23
2014-01-24 7 2014-01-24
2014-01-25 12 2014-01-24
2014-01-26 11 2014-01-24
The same can be done for the maximum:
ilocs_window = pd.rolling_apply(df, window=5, func=lambda x: x.argmax(), min_periods=1)
ilocs = ilocs_window['value'] + ([0, 0, 0, 0] + range(len(ilocs_window)-4))
df['rolling_max_date'] = df.index.take(ilocs)
Here is a workaround.
import pandas as pd
import numpy as np
# sample data
# ===============================================
np.random.seed(0)
df = pd.DataFrame(np.random.randint(1,30,20), index=pd.date_range('2015-01-01', periods=20, freq='D'), columns=['value'])
df
value
2015-01-01 13
2015-01-02 16
2015-01-03 22
2015-01-04 1
2015-01-05 4
2015-01-06 28
2015-01-07 4
2015-01-08 8
2015-01-09 10
2015-01-10 20
2015-01-11 22
2015-01-12 19
2015-01-13 5
2015-01-14 24
2015-01-15 7
2015-01-16 25
2015-01-17 25
2015-01-18 13
2015-01-19 27
2015-01-20 2
# processing
# ==========================================
# your cumstom function to track on max/min value/date
def track_minmax(df):
return pd.Series({'current_date': df.index[-1], 'rolling_max_val': df['value'].max(), 'rolling_max_date': df['value'].idxmax(), 'rolling_min_val': df['value'].min(), 'rolling_min_date': df['value'].idxmin()})
window = 5
# use list comprehension to do the for loop
pd.DataFrame([track_minmax(df.iloc[i:i+window]) for i in range(len(df)-window+1)]).set_index('current_date').reindex(df.index)
rolling_max_date rolling_max_val rolling_min_date rolling_min_val
2015-01-01 NaT NaN NaT NaN
2015-01-02 NaT NaN NaT NaN
2015-01-03 NaT NaN NaT NaN
2015-01-04 NaT NaN NaT NaN
2015-01-05 2015-01-03 22 2015-01-04 1
2015-01-06 2015-01-06 28 2015-01-04 1
2015-01-07 2015-01-06 28 2015-01-04 1
2015-01-08 2015-01-06 28 2015-01-04 1
2015-01-09 2015-01-06 28 2015-01-05 4
2015-01-10 2015-01-06 28 2015-01-07 4
2015-01-11 2015-01-11 22 2015-01-07 4
2015-01-12 2015-01-11 22 2015-01-08 8
2015-01-13 2015-01-11 22 2015-01-13 5
2015-01-14 2015-01-14 24 2015-01-13 5
2015-01-15 2015-01-14 24 2015-01-13 5
2015-01-16 2015-01-16 25 2015-01-13 5
2015-01-17 2015-01-16 25 2015-01-13 5
2015-01-18 2015-01-16 25 2015-01-15 7
2015-01-19 2015-01-19 27 2015-01-15 7
2015-01-20 2015-01-19 27 2015-01-20 2