Most recent max/min value - python

I have the following dataframe:
date value
2014-01-20 10
2014-01-21 12
2014-01-22 13
2014-01-23 9
2014-01-24 7
2014-01-25 12
2014-01-26 11
I need to be able to keep track of when the latest maximum and minimum value occurred within a specific rolling window. For example if I were to use a rolling window period of 5, then I would need an output like the following:
date value rolling_max_date rolling_min_date
2014-01-20 10 2014-01-20 2014-01-20
2014-01-21 12 2014-01-21 2014-01-20
2014-01-22 13 2014-01-22 2014-01-20
2014-01-23 9 2014-01-22 2014-01-23
2014-01-24 7 2014-01-22 2014-01-24
2014-01-25 12 2014-01-22 2014-01-24
2014-01-26 11 2014-01-25 2014-01-24
All this shows is, what is the date of the latest maximum and minimum value within the rolling window. I know pandas has rolling_min and rolling_max, but im not sure how to keep track of the index/date of when the most recent max/min occured within the window.

There is a more general rolling_apply where you can provide your own function. However, the custom functions receives the windows as arrays, not dataframes, so the index information is not available (so you cannot use idxmin/max).
But lets try to achieve this in two steps:
In [41]: df = df.set_index('date')
In [42]: pd.rolling_apply(df, window=5, func=lambda x: x.argmin(), min_periods=1)
Out[42]:
value
date
2014-01-20 0
2014-01-21 0
2014-01-22 0
2014-01-23 3
2014-01-24 4
2014-01-25 3
2014-01-26 2
This gives you the index in the window where the minimum is found. But, this index is for that particular window and not for the entire dataframe. So let's add the start of the window, and then use this integer location to retrieve the correct index locations index:
In [45]: ilocs_window = pd.rolling_apply(df, window=5, func=lambda x: x.argmin(), min_periods=1)
In [46]: ilocs = ilocs_window['value'] + ([0, 0, 0, 0] + range(len(ilocs_window)-4))
In [47]: ilocs
Out[47]:
date
2014-01-20 0
2014-01-21 0
2014-01-22 0
2014-01-23 3
2014-01-24 4
2014-01-25 4
2014-01-26 4
Name: value, dtype: float64
In [48]: df.index.take(ilocs)
Out[48]:
Index([u'2014-01-20', u'2014-01-20', u'2014-01-20', u'2014-01-23',
u'2014-01-24', u'2014-01-24', u'2014-01-24'],
dtype='object', name=u'date')
In [49]: df['rolling_min_date'] = df.index.take(ilocs)
In [50]: df
Out[50]:
value rolling_min_date
date
2014-01-20 10 2014-01-20
2014-01-21 12 2014-01-20
2014-01-22 13 2014-01-20
2014-01-23 9 2014-01-23
2014-01-24 7 2014-01-24
2014-01-25 12 2014-01-24
2014-01-26 11 2014-01-24
The same can be done for the maximum:
ilocs_window = pd.rolling_apply(df, window=5, func=lambda x: x.argmax(), min_periods=1)
ilocs = ilocs_window['value'] + ([0, 0, 0, 0] + range(len(ilocs_window)-4))
df['rolling_max_date'] = df.index.take(ilocs)

Here is a workaround.
import pandas as pd
import numpy as np
# sample data
# ===============================================
np.random.seed(0)
df = pd.DataFrame(np.random.randint(1,30,20), index=pd.date_range('2015-01-01', periods=20, freq='D'), columns=['value'])
df
value
2015-01-01 13
2015-01-02 16
2015-01-03 22
2015-01-04 1
2015-01-05 4
2015-01-06 28
2015-01-07 4
2015-01-08 8
2015-01-09 10
2015-01-10 20
2015-01-11 22
2015-01-12 19
2015-01-13 5
2015-01-14 24
2015-01-15 7
2015-01-16 25
2015-01-17 25
2015-01-18 13
2015-01-19 27
2015-01-20 2
# processing
# ==========================================
# your cumstom function to track on max/min value/date
def track_minmax(df):
return pd.Series({'current_date': df.index[-1], 'rolling_max_val': df['value'].max(), 'rolling_max_date': df['value'].idxmax(), 'rolling_min_val': df['value'].min(), 'rolling_min_date': df['value'].idxmin()})
window = 5
# use list comprehension to do the for loop
pd.DataFrame([track_minmax(df.iloc[i:i+window]) for i in range(len(df)-window+1)]).set_index('current_date').reindex(df.index)
rolling_max_date rolling_max_val rolling_min_date rolling_min_val
2015-01-01 NaT NaN NaT NaN
2015-01-02 NaT NaN NaT NaN
2015-01-03 NaT NaN NaT NaN
2015-01-04 NaT NaN NaT NaN
2015-01-05 2015-01-03 22 2015-01-04 1
2015-01-06 2015-01-06 28 2015-01-04 1
2015-01-07 2015-01-06 28 2015-01-04 1
2015-01-08 2015-01-06 28 2015-01-04 1
2015-01-09 2015-01-06 28 2015-01-05 4
2015-01-10 2015-01-06 28 2015-01-07 4
2015-01-11 2015-01-11 22 2015-01-07 4
2015-01-12 2015-01-11 22 2015-01-08 8
2015-01-13 2015-01-11 22 2015-01-13 5
2015-01-14 2015-01-14 24 2015-01-13 5
2015-01-15 2015-01-14 24 2015-01-13 5
2015-01-16 2015-01-16 25 2015-01-13 5
2015-01-17 2015-01-16 25 2015-01-13 5
2015-01-18 2015-01-16 25 2015-01-15 7
2015-01-19 2015-01-19 27 2015-01-15 7
2015-01-20 2015-01-19 27 2015-01-20 2

Related

python: compare data of different date types

I have a question of comparing data of datetime64[ns] and date like '2017-01-01'.
here is the code:
df.loc[(df['Date'] >= datetime.date(2017.1.1), 'TimeRange'] = '2017.1'
but , an error has been showed and said descriptor 'date' requires a 'datetime.datetime' object but received a 'int'.
how can i compare a datetime64 to data (2017-01-01 or 2-17-6-1 and likes)
Thanks
Demo:
Source DF:
In [83]: df = pd.DataFrame({'tm':pd.date_range('2000-01-01', freq='9999T', periods=20)})
In [84]: df
Out[84]:
tm
0 2000-01-01 00:00:00
1 2000-01-07 22:39:00
2 2000-01-14 21:18:00
3 2000-01-21 19:57:00
4 2000-01-28 18:36:00
5 2000-02-04 17:15:00
6 2000-02-11 15:54:00
7 2000-02-18 14:33:00
8 2000-02-25 13:12:00
9 2000-03-03 11:51:00
10 2000-03-10 10:30:00
11 2000-03-17 09:09:00
12 2000-03-24 07:48:00
13 2000-03-31 06:27:00
14 2000-04-07 05:06:00
15 2000-04-14 03:45:00
16 2000-04-21 02:24:00
17 2000-04-28 01:03:00
18 2000-05-04 23:42:00
19 2000-05-11 22:21:00
Filtering:
In [85]: df.loc[df.tm > '2000-03-01']
Out[85]:
tm
9 2000-03-03 11:51:00
10 2000-03-10 10:30:00
11 2000-03-17 09:09:00
12 2000-03-24 07:48:00
13 2000-03-31 06:27:00
14 2000-04-07 05:06:00
15 2000-04-14 03:45:00
16 2000-04-21 02:24:00
17 2000-04-28 01:03:00
18 2000-05-04 23:42:00
19 2000-05-11 22:21:00
In [86]: df.loc[df.tm > '2000-3-1']
Out[86]:
tm
9 2000-03-03 11:51:00
10 2000-03-10 10:30:00
11 2000-03-17 09:09:00
12 2000-03-24 07:48:00
13 2000-03-31 06:27:00
14 2000-04-07 05:06:00
15 2000-04-14 03:45:00
16 2000-04-21 02:24:00
17 2000-04-28 01:03:00
18 2000-05-04 23:42:00
19 2000-05-11 22:21:00
not standard date format:
In [87]: df.loc[df.tm > pd.to_datetime('03/01/2000')]
Out[87]:
tm
9 2000-03-03 11:51:00
10 2000-03-10 10:30:00
11 2000-03-17 09:09:00
12 2000-03-24 07:48:00
13 2000-03-31 06:27:00
14 2000-04-07 05:06:00
15 2000-04-14 03:45:00
16 2000-04-21 02:24:00
17 2000-04-28 01:03:00
18 2000-05-04 23:42:00
19 2000-05-11 22:21:00
You need to ensure that the data you're comparing it with is also in the same format. Assuming that you have two datetime objects, you can do it like this:
import datetime
print(df.loc[(df['Date'] >= datetime.date(2017, 1, 1), 'TimeRange'])
This will create a datetime object and list out the filtered results. You can also assign the results an updated value as you have mentioned above.

Pandas: Subtracting two date columns and the result being an integer

I have two columns in a Pandas data frame that are dates.
I am looking to subtract one column from another and the result being the difference in numbers of days as an integer.
A peek at the data:
df_test.head(10)
Out[20]:
First_Date Second Date
0 2016-02-09 2015-11-19
1 2016-01-06 2015-11-30
2 NaT 2015-12-04
3 2016-01-06 2015-12-08
4 NaT 2015-12-09
5 2016-01-07 2015-12-11
6 NaT 2015-12-12
7 NaT 2015-12-14
8 2016-01-06 2015-12-14
9 NaT 2015-12-15
I have created a new column successfully with the difference:
df_test['Difference'] = df_test['First_Date'].sub(df_test['Second Date'], axis=0)
df_test.head()
Out[22]:
First_Date Second Date Difference
0 2016-02-09 2015-11-19 82 days
1 2016-01-06 2015-11-30 37 days
2 NaT 2015-12-04 NaT
3 2016-01-06 2015-12-08 29 days
4 NaT 2015-12-09 NaT
However I am unable to get a numeric version of the result:
df_test['Difference'] = df_test[['Difference']].apply(pd.to_numeric)
df_test.head()
Out[25]:
First_Date Second Date Difference
0 2016-02-09 2015-11-19 7.084800e+15
1 2016-01-06 2015-11-30 3.196800e+15
2 NaT 2015-12-04 NaN
3 2016-01-06 2015-12-08 2.505600e+15
4 NaT 2015-12-09 NaN
How about:
df_test['Difference'] = (df_test['First_Date'] - df_test['Second Date']).dt.days
This will return difference as int if there are no missing values(NaT) and float if there is.
Pandas have a rich documentation on Time series / date functionality and Time deltas
You can divide column of dtype timedelta by np.timedelta64(1, 'D'), but output is not int, but float, because NaN values:
df_test['Difference'] = df_test['Difference'] / np.timedelta64(1, 'D')
print (df_test)
First_Date Second Date Difference
0 2016-02-09 2015-11-19 82.0
1 2016-01-06 2015-11-30 37.0
2 NaT 2015-12-04 NaN
3 2016-01-06 2015-12-08 29.0
4 NaT 2015-12-09 NaN
5 2016-01-07 2015-12-11 27.0
6 NaT 2015-12-12 NaN
7 NaT 2015-12-14 NaN
8 2016-01-06 2015-12-14 23.0
9 NaT 2015-12-15 NaN
Frequency conversion.
You can use datetime module to help here. Also, as a side note, a simple date subtraction should work as below:
import datetime as dt
import numpy as np
import pandas as pd
#Assume we have df_test:
In [222]: df_test
Out[222]:
first_date second_date
0 2016-01-31 2015-11-19
1 2016-02-29 2015-11-20
2 2016-03-31 2015-11-21
3 2016-04-30 2015-11-22
4 2016-05-31 2015-11-23
5 2016-06-30 2015-11-24
6 NaT 2015-11-25
7 NaT 2015-11-26
8 2016-01-31 2015-11-27
9 NaT 2015-11-28
10 NaT 2015-11-29
11 NaT 2015-11-30
12 2016-04-30 2015-12-01
13 NaT 2015-12-02
14 NaT 2015-12-03
15 2016-04-30 2015-12-04
16 NaT 2015-12-05
17 NaT 2015-12-06
In [223]: df_test['Difference'] = df_test['first_date'] - df_test['second_date']
In [224]: df_test
Out[224]:
first_date second_date Difference
0 2016-01-31 2015-11-19 73 days
1 2016-02-29 2015-11-20 101 days
2 2016-03-31 2015-11-21 131 days
3 2016-04-30 2015-11-22 160 days
4 2016-05-31 2015-11-23 190 days
5 2016-06-30 2015-11-24 219 days
6 NaT 2015-11-25 NaT
7 NaT 2015-11-26 NaT
8 2016-01-31 2015-11-27 65 days
9 NaT 2015-11-28 NaT
10 NaT 2015-11-29 NaT
11 NaT 2015-11-30 NaT
12 2016-04-30 2015-12-01 151 days
13 NaT 2015-12-02 NaT
14 NaT 2015-12-03 NaT
15 2016-04-30 2015-12-04 148 days
16 NaT 2015-12-05 NaT
17 NaT 2015-12-06 NaT
Now, change type to datetime.timedelta, and then use the .days method on valid timedelta objects.
In [226]: df_test['Diffference'] = df_test['Difference'].astype(dt.timedelta).map(lambda x: np.nan if pd.isnull(x) else x.days)
In [227]: df_test
Out[227]:
first_date second_date Difference Diffference
0 2016-01-31 2015-11-19 73 days 73
1 2016-02-29 2015-11-20 101 days 101
2 2016-03-31 2015-11-21 131 days 131
3 2016-04-30 2015-11-22 160 days 160
4 2016-05-31 2015-11-23 190 days 190
5 2016-06-30 2015-11-24 219 days 219
6 NaT 2015-11-25 NaT NaN
7 NaT 2015-11-26 NaT NaN
8 2016-01-31 2015-11-27 65 days 65
9 NaT 2015-11-28 NaT NaN
10 NaT 2015-11-29 NaT NaN
11 NaT 2015-11-30 NaT NaN
12 2016-04-30 2015-12-01 151 days 151
13 NaT 2015-12-02 NaT NaN
14 NaT 2015-12-03 NaT NaN
15 2016-04-30 2015-12-04 148 days 148
16 NaT 2015-12-05 NaT NaN
17 NaT 2015-12-06 NaT NaN
Hope that helps.
I feel that the overall answer does not handle if the dates 'wrap' around a year. This would be useful in understanding proximity to a date being accurate by day of year. In order to do these row operations, I did the following. (I had this used in a business setting in renewing customer subscriptions).
def get_date_difference(row, x, y):
try:
# Calcuating the smallest date difference between the start and the close date
# There's some tricky logic in here to calculate for determining date difference
# the other way around (Dec -> Jan is 1 month rather than 11)
sub_start_date = int(row[x].strftime('%j')) # day of year (1-366)
close_date = int(row[y].strftime('%j')) # day of year (1-366)
later_date_of_year = max(sub_start_date, close_date)
earlier_date_of_year = min(sub_start_date, close_date)
days_diff = later_date_of_year - earlier_date_of_year
# Calculates the difference going across the next year (December -> Jan)
days_diff_reversed = (365 - later_date_of_year) + earlier_date_of_year
return min(days_diff, days_diff_reversed)
except ValueError:
return None
Then the function could be:
dfAC_Renew['date_difference'] = dfAC_Renew.apply(get_date_difference, x = 'customer_since_date', y = 'renewal_date', axis = 1)
Create a vectorized method
def calc_xb_minus_xa(df):
time_dict = {
'<Minute>': 'm',
'<Hour>': 'h',
'<Day>': 'D',
'<Week>': 'W',
'<Month>': 'M',
'<Year>': 'Y'
}
time_delta = df.at[df.index[0], 'end_time'] - df.at[df.index[0], 'open_time']
offset_base_name = str(to_offset(time_delta).base)
time_term = time_dict.get(offset_base_name)
result = (df.end_time - df.open_time) / np.timedelta64(1, time_term)
return result
Then in your df do:
df['x'] = calc_xb_minus_xa(df)
This will work for minutes, hours, days, weeks, month and Year.
open_time and end_time need to change according your df

How do I delete holidays from a Pandas series?

I have a pandas series with two columns and lots of rows like:
r =
1-10-2010 3.4
1-11-2010 4.5
1-12-2010 3.7
... ...
What I'd like to do is to remove days of the week not in a custom week. So to remove Fridays and Saturdays, do something like this:
r = amazingfunction(r, ('Sun', 'Mon', 'Tue', 'Wed', Thu'))
r =
1-10-2010 3.4
1-11-2010 4.5
1-12-2010 3.7
1-13-2010 3.4
1-14-2010 4.1
1-17-2010 4.5
1-18-2010 3.7
... ...
How can I go about this?
You can use dt.dayofweek and isin to filter the df, here Friday and Saturday are 4,5 respectively, we negate the boolean mask using ~:
In [12]:
df = pd.DataFrame({'dates':pd.date_range(dt.datetime(2015,1,1), dt.datetime(2015,2,1))})
df['dayofweek'] = df['dates'].dt.dayofweek
df
Out[12]:
dates dayofweek
0 2015-01-01 3
1 2015-01-02 4
2 2015-01-03 5
3 2015-01-04 6
4 2015-01-05 0
5 2015-01-06 1
6 2015-01-07 2
7 2015-01-08 3
8 2015-01-09 4
9 2015-01-10 5
10 2015-01-11 6
11 2015-01-12 0
12 2015-01-13 1
13 2015-01-14 2
14 2015-01-15 3
15 2015-01-16 4
16 2015-01-17 5
17 2015-01-18 6
18 2015-01-19 0
19 2015-01-20 1
20 2015-01-21 2
21 2015-01-22 3
22 2015-01-23 4
23 2015-01-24 5
24 2015-01-25 6
25 2015-01-26 0
26 2015-01-27 1
27 2015-01-28 2
28 2015-01-29 3
29 2015-01-30 4
30 2015-01-31 5
31 2015-02-01 6
In [13]:
df[~df['dates'].dt.dayofweek.isin([4,5])]
Out[13]:
dates dayofweek
0 2015-01-01 3
3 2015-01-04 6
4 2015-01-05 0
5 2015-01-06 1
6 2015-01-07 2
7 2015-01-08 3
10 2015-01-11 6
11 2015-01-12 0
12 2015-01-13 1
13 2015-01-14 2
14 2015-01-15 3
17 2015-01-18 6
18 2015-01-19 0
19 2015-01-20 1
20 2015-01-21 2
21 2015-01-22 3
24 2015-01-25 6
25 2015-01-26 0
26 2015-01-27 1
27 2015-01-28 2
28 2015-01-29 3
31 2015-02-01 6
EDIT
As your data is a Series your dates are your index so the following should work:
r[~r.index.dayofweek.isin([4,5])]

Changing time components of pandas datetime64 column

I have a dataframe that can be simplified as:
date id
0 02/04/2015 02:34 1
1 06/04/2015 12:34 2
2 09/04/2015 23:03 3
3 12/04/2015 01:00 4
4 15/04/2015 07:12 5
5 21/04/2015 12:59 6
6 29/04/2015 17:33 7
7 04/05/2015 10:44 8
8 06/05/2015 11:12 9
9 10/05/2015 08:52 10
10 12/05/2015 14:19 11
11 19/05/2015 19:22 12
12 27/05/2015 22:31 13
13 01/06/2015 11:09 14
14 04/06/2015 12:57 15
15 10/06/2015 04:00 16
16 15/06/2015 03:23 17
17 19/06/2015 05:37 18
18 23/06/2015 13:41 19
19 27/06/2015 15:43 20
It can be created using:
tempDF = pd.DataFrame({ 'id': [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20],
'date': ["02/04/2015 02:34","06/04/2015 12:34","09/04/2015 23:03","12/04/2015 01:00","15/04/2015 07:12","21/04/2015 12:59","29/04/2015 17:33","04/05/2015 10:44","06/05/2015 11:12","10/05/2015 08:52","12/05/2015 14:19","19/05/2015 19:22","27/05/2015 22:31","01/06/2015 11:09","04/06/2015 12:57","10/06/2015 04:00","15/06/2015 03:23","19/06/2015 05:37","23/06/2015 13:41","27/06/2015 15:43"]})
The data has the following types:
tempDF.dtypes
date object
id int64
dtype: object
I have set the 'date' variable to be Pandas datefime64 format (if that's the right way to describe it) using:
import numpy as np
import pandas as pd
tempDF['date'] = pd_to_datetime(tempDF['date'])
So now, the dtypes look like:
tempDF.dtypes
date datetime64[ns]
id int64
dtype: object
I want to change the hours of the original date data. I can use .normalize() to convert to midnight via the .dt accessor:
tempDF['date'] = tempDF['date'].dt.normalize()
And, I can get access to individual datetime components (e.g. year) using:
tempDF['date'].dt.year
This produces:
0 2015
1 2015
2 2015
3 2015
4 2015
5 2015
6 2015
7 2015
8 2015
9 2015
10 2015
11 2015
12 2015
13 2015
14 2015
15 2015
16 2015
17 2015
18 2015
19 2015
Name: date, dtype: int64
The question is, how can I change specific date and time components? For example, how could I change the midday (12:00) for all the dates? I've found that datetime.datetime has a .replace() function. However, having converted dates to Pandas format, it would make sense to keep in that format. Is there a way to do that without changing the format again?
EDIT :
A vectorized way to do this would be to normalize the series, and then add 12 hours to it using timedelta. Example -
tempDF['date'].dt.normalize() + datetime.timedelta(hours=12)
Demo -
In [59]: tempDF
Out[59]:
date id
0 2015-02-04 12:00:00 1
1 2015-06-04 12:00:00 2
2 2015-09-04 12:00:00 3
3 2015-12-04 12:00:00 4
4 2015-04-15 12:00:00 5
5 2015-04-21 12:00:00 6
6 2015-04-29 12:00:00 7
7 2015-04-05 12:00:00 8
8 2015-06-05 12:00:00 9
9 2015-10-05 12:00:00 10
10 2015-12-05 12:00:00 11
11 2015-05-19 12:00:00 12
12 2015-05-27 12:00:00 13
13 2015-01-06 12:00:00 14
14 2015-04-06 12:00:00 15
15 2015-10-06 12:00:00 16
16 2015-06-15 12:00:00 17
17 2015-06-19 12:00:00 18
18 2015-06-23 12:00:00 19
19 2015-06-27 12:00:00 20
In [60]: tempDF['date'].dt.normalize() + datetime.timedelta(hours=12)
Out[60]:
0 2015-02-04 12:00:00
1 2015-06-04 12:00:00
2 2015-09-04 12:00:00
3 2015-12-04 12:00:00
4 2015-04-15 12:00:00
5 2015-04-21 12:00:00
6 2015-04-29 12:00:00
7 2015-04-05 12:00:00
8 2015-06-05 12:00:00
9 2015-10-05 12:00:00
10 2015-12-05 12:00:00
11 2015-05-19 12:00:00
12 2015-05-27 12:00:00
13 2015-01-06 12:00:00
14 2015-04-06 12:00:00
15 2015-10-06 12:00:00
16 2015-06-15 12:00:00
17 2015-06-19 12:00:00
18 2015-06-23 12:00:00
19 2015-06-27 12:00:00
dtype: datetime64[ns]
Timing information for both methods at bottom
One method would be to use Series.apply along with the .replace() method OP mentions in his post. Example -
tempDF['date'] = tempDF['date'].apply(lambda x:x.replace(hour=12,minute=0))
Demo -
In [12]: tempDF
Out[12]:
date id
0 2015-02-04 02:34:00 1
1 2015-06-04 12:34:00 2
2 2015-09-04 23:03:00 3
3 2015-12-04 01:00:00 4
4 2015-04-15 07:12:00 5
5 2015-04-21 12:59:00 6
6 2015-04-29 17:33:00 7
7 2015-04-05 10:44:00 8
8 2015-06-05 11:12:00 9
9 2015-10-05 08:52:00 10
10 2015-12-05 14:19:00 11
11 2015-05-19 19:22:00 12
12 2015-05-27 22:31:00 13
13 2015-01-06 11:09:00 14
14 2015-04-06 12:57:00 15
15 2015-10-06 04:00:00 16
16 2015-06-15 03:23:00 17
17 2015-06-19 05:37:00 18
18 2015-06-23 13:41:00 19
19 2015-06-27 15:43:00 20
In [13]: tempDF['date'] = tempDF['date'].apply(lambda x:x.replace(hour=12,minute=0))
In [14]: tempDF
Out[14]:
date id
0 2015-02-04 12:00:00 1
1 2015-06-04 12:00:00 2
2 2015-09-04 12:00:00 3
3 2015-12-04 12:00:00 4
4 2015-04-15 12:00:00 5
5 2015-04-21 12:00:00 6
6 2015-04-29 12:00:00 7
7 2015-04-05 12:00:00 8
8 2015-06-05 12:00:00 9
9 2015-10-05 12:00:00 10
10 2015-12-05 12:00:00 11
11 2015-05-19 12:00:00 12
12 2015-05-27 12:00:00 13
13 2015-01-06 12:00:00 14
14 2015-04-06 12:00:00 15
15 2015-10-06 12:00:00 16
16 2015-06-15 12:00:00 17
17 2015-06-19 12:00:00 18
18 2015-06-23 12:00:00 19
19 2015-06-27 12:00:00 20
Timing information
In [52]: df = pd.DataFrame([[datetime.datetime.now()] for _ in range(100000)],columns=['date'])
In [54]: %%timeit
....: df['date'].dt.normalize() + datetime.timedelta(hours=12)
....:
The slowest run took 12.53 times longer than the fastest. This could mean that an intermediate result is being cached
1 loops, best of 3: 32.3 ms per loop
In [57]: %%timeit
....: df['date'].apply(lambda x:x.replace(hour=12,minute=0))
....:
1 loops, best of 3: 1.09 s per loop
Here's the solution I used to replace the time component of the datetime values in a Pandas DataFrame. Not sure how efficient this solution is, but it fit my needs.
import pandas as pd
# Create a list of EOCY dates for a specified period
sDate = pd.Timestamp('2022-01-31 23:59:00')
eDate = pd.Timestamp('2060-01-31 23:59:00')
dtList = pd.date_range(sDate, eDate, freq='Y').to_pydatetime()
# Create a DataFrame with a single column called 'Date' and fill the rows with the list of EOCY dates.
df = pd.DataFrame({'Date': dtList})
# Loop through the DataFrame rows using the replace function to replace the hours and minutes of each date value.
for i in range(df.shape[0]):
df.iloc[i, 0]=df.iloc[i, 0].replace(hour=00, minute=00)
Not sure how efficient this solution is, but it fit my needs.

Remove seconds from date Pandas

I have a dataframe that contains a column with a date (StartTime) in the following format: 28-7-2015 0:09:00 the same dataframe contains also a column that contains the number of seconds (SetupDuration1).
I would like to create a new column that subtracts the number of seconds from the datefield,
dftask['Start'] = dftask['StartTime'] - dftask['SetupDuration1']
The SetupDuration1 column is a numeric column and must stay a numeric column because I do different operations on this column, take absolute value etc.
So how should I subtract the number of seconds in the correct way. ?
apply a lambda to convert to timedelta and then subtract:
In [88]:
df = pd.DataFrame({'StartTime':pd.date_range(start=dt.datetime(2015,1,1), end = dt.datetime(2015,2,1)), 'SetupDuration1':np.random.randint(0, 59, size=32)})
df
Out[88]:
SetupDuration1 StartTime
0 14 2015-01-01
1 55 2015-01-02
2 21 2015-01-03
3 50 2015-01-04
4 21 2015-01-05
5 6 2015-01-06
6 6 2015-01-07
7 2 2015-01-08
8 10 2015-01-09
9 3 2015-01-10
10 11 2015-01-11
11 32 2015-01-12
12 53 2015-01-13
13 45 2015-01-14
14 48 2015-01-15
15 23 2015-01-16
16 7 2015-01-17
17 5 2015-01-18
18 18 2015-01-19
19 26 2015-01-20
20 48 2015-01-21
21 8 2015-01-22
22 58 2015-01-23
23 24 2015-01-24
24 47 2015-01-25
25 10 2015-01-26
26 32 2015-01-27
27 26 2015-01-28
28 36 2015-01-29
29 36 2015-01-30
30 40 2015-01-31
31 18 2015-02-01
In [94]:
df['Start'] = df['StartTime'] - df['SetupDuration1'].apply(lambda x: pd.Timedelta(x, 's'))
df
Out[94]:
SetupDuration1 StartTime Start
0 14 2015-01-01 2014-12-31 23:59:46
1 55 2015-01-02 2015-01-01 23:59:05
2 21 2015-01-03 2015-01-02 23:59:39
3 50 2015-01-04 2015-01-03 23:59:10
4 21 2015-01-05 2015-01-04 23:59:39
5 6 2015-01-06 2015-01-05 23:59:54
6 6 2015-01-07 2015-01-06 23:59:54
7 2 2015-01-08 2015-01-07 23:59:58
8 10 2015-01-09 2015-01-08 23:59:50
9 3 2015-01-10 2015-01-09 23:59:57
10 11 2015-01-11 2015-01-10 23:59:49
11 32 2015-01-12 2015-01-11 23:59:28
12 53 2015-01-13 2015-01-12 23:59:07
13 45 2015-01-14 2015-01-13 23:59:15
14 48 2015-01-15 2015-01-14 23:59:12
15 23 2015-01-16 2015-01-15 23:59:37
16 7 2015-01-17 2015-01-16 23:59:53
17 5 2015-01-18 2015-01-17 23:59:55
18 18 2015-01-19 2015-01-18 23:59:42
19 26 2015-01-20 2015-01-19 23:59:34
20 48 2015-01-21 2015-01-20 23:59:12
21 8 2015-01-22 2015-01-21 23:59:52
22 58 2015-01-23 2015-01-22 23:59:02
23 24 2015-01-24 2015-01-23 23:59:36
24 47 2015-01-25 2015-01-24 23:59:13
25 10 2015-01-26 2015-01-25 23:59:50
26 32 2015-01-27 2015-01-26 23:59:28
27 26 2015-01-28 2015-01-27 23:59:34
28 36 2015-01-29 2015-01-28 23:59:24
29 36 2015-01-30 2015-01-29 23:59:24
30 40 2015-01-31 2015-01-30 23:59:20
31 18 2015-02-01 2015-01-31 23:59:42
Timings
Actually it looks quicker to just construct a Timedeltaindex inplace:
In [99]:
%timeit df['Start'] = df['StartTime'] - pd.TimedeltaIndex(df['SetupDuration1'], unit='s')
1000 loops, best of 3: 837 µs per loop
In [100]:
%timeit df['Start'] = df['StartTime'] - df['SetupDuration1'].apply(lambda x: pd.Timedelta(x, 's'))
100 loops, best of 3: 1.97 ms per loop
So I'd just do:
In [101]:
df['Start'] = df['StartTime'] - pd.TimedeltaIndex(df['SetupDuration1'], unit='s')
df
Out[101]:
SetupDuration1 StartTime Start
0 14 2015-01-01 2014-12-31 23:59:46
1 55 2015-01-02 2015-01-01 23:59:05
2 21 2015-01-03 2015-01-02 23:59:39
3 50 2015-01-04 2015-01-03 23:59:10
4 21 2015-01-05 2015-01-04 23:59:39
5 6 2015-01-06 2015-01-05 23:59:54
6 6 2015-01-07 2015-01-06 23:59:54
7 2 2015-01-08 2015-01-07 23:59:58
8 10 2015-01-09 2015-01-08 23:59:50
9 3 2015-01-10 2015-01-09 23:59:57
10 11 2015-01-11 2015-01-10 23:59:49
11 32 2015-01-12 2015-01-11 23:59:28
12 53 2015-01-13 2015-01-12 23:59:07
13 45 2015-01-14 2015-01-13 23:59:15
14 48 2015-01-15 2015-01-14 23:59:12
15 23 2015-01-16 2015-01-15 23:59:37
16 7 2015-01-17 2015-01-16 23:59:53
17 5 2015-01-18 2015-01-17 23:59:55
18 18 2015-01-19 2015-01-18 23:59:42
19 26 2015-01-20 2015-01-19 23:59:34
20 48 2015-01-21 2015-01-20 23:59:12
21 8 2015-01-22 2015-01-21 23:59:52
22 58 2015-01-23 2015-01-22 23:59:02
23 24 2015-01-24 2015-01-23 23:59:36
24 47 2015-01-25 2015-01-24 23:59:13
25 10 2015-01-26 2015-01-25 23:59:50
26 32 2015-01-27 2015-01-26 23:59:28
27 26 2015-01-28 2015-01-27 23:59:34
28 36 2015-01-29 2015-01-28 23:59:24
29 36 2015-01-30 2015-01-29 23:59:24
30 40 2015-01-31 2015-01-30 23:59:20
31 18 2015-02-01 2015-01-31 23:59:42

Categories