Vectorized Operations on a datetime column in pandas - python

I want to take a column of datetime objects and return a column of integers that are "days from that datetime until today". I can do it in an ugly way, looking for a prettier (and faster) way.
So suppose I have a dataframe with a datetime column like so:
11 2014-03-04 17:16:26+00:00
12 2014-03-10 01:35:56+00:00
13 2014-03-15 02:35:51+00:00
14 2014-03-20 05:55:47+00:00
15 2014-03-26 04:56:33+00:00
Name: datetime, dtype: object
And each element looks like:
datetime.datetime(2014, 3, 4, 17, 16, 26, tzinfo=<UTC>)
Suppose I want to calculate how many days ago each observation occurred, and return that as a simple integer. I know I can just use apply twice, but is there a vectorized/cleaner way to do it?
today = datetime.datetime.today().date()
df_dates = df['datetime'].apply(lambda x: x.date())
days_ago = today - df_dates
Which gives a timedelta64[ns] Series.
11 56 days, 00:00:00
12 50 days, 00:00:00
13 45 days, 00:00:00
14 40 days, 00:00:00
15 34 days, 00:00:00
Name: datetime, dtype: timedelta64[ns]
And then finally if I want it as an integer:
days_ago_as_int = days_ago.apply(lambda x: x.item().days)
days_ago_as_int
11 56
12 50
13 45
14 40
15 34
Name: datetime, dtype: int64
Any thoughts?
Related questions that didn't quite get at what I was asking:
Pandas Python- can datetime be used with vectorized inputs
Pandas add one day to column
Trying Karl D's answer, I'm successfully able to get today's date and the date column as desired, but something goes awry in the subtraction (different datetimes than in the original example, but shouldn't matter, right?):
converted_dates = df['date'].values.astype('datetime64[D]')
today_date = np.datetime64(dt.date.today())
print converted_dates
print today_date
print today_date - converted_dates
[2014-01-16 00:00:00
2014-01-19 00:00:00
2014-01-22 00:00:00
2014-01-26 00:00:00
2014-01-29 00:00:00]
2014-04-30 00:00:00
[16189 days, 0:08:20.637994
16189 days, 0:08:20.637991
16189 days, 0:08:20.637988
16189 days, 0:08:20.637984
16189 days, 0:08:20.637981]

How about (for a column named date)?
import datetime as dt
df['foo'] = (np.datetime64(dt.date.today())
- df['date'].values.astype('datetime64[D]'))
print df
date foo
0 2014-03-04 17:16:26 56 days
1 2014-03-10 01:35:56 50 days
2 2014-03-15 02:35:51 45 days
3 2014-03-20 05:55:47 40 days
4 2014-03-26 04:56:33 34 days
Or if you wanted it as an int:
df['foo'] = (np.datetime64(dt.date.today())
- df['date'].values.astype('datetime64[D]')).astype(int)
print df
date foo
0 2014-03-04 17:16:26 56
1 2014-03-10 01:35:56 50
2 2014-03-15 02:35:51 45
3 2014-03-20 05:55:47 40
4 2014-03-26 04:56:33 34
Or if it was an index
print np.datetime64(dt.date.today()) - df.index.values.astype('datetime64[D]')
[56 50 45 40 34]
Much later Edit: How about this for a work around?
>>> print df
date
0 2014-03-04 17:16:26
1 2014-03-10 01:35:56
2 2014-03-15 02:35:51
3 2014-03-20 05:55:47
4 2014-03-26 04:56:33
Try assigning today's date to a column so it gets converted to a datetime64 column by pandas and then do the arithmetic:
>>> df['today'] = dt.date.today()
>>> df['foo'] = (df['today'].values.astype('datetime64[D]')
- df['date'].values.astype('datetime64[D]'))
>>> print df
date today foo
0 2014-03-04 17:16:26 2014-05-14 71 days
1 2014-03-10 01:35:56 2014-05-14 65 days
2 2014-03-15 02:35:51 2014-05-14 60 days
3 2014-03-20 05:55:47 2014-05-14 55 days
4 2014-03-26 04:56:33 2014-05-14 49 days

Related

I am trying to return just the amount of days between deadline and launch date. However it is returning the hours minutes and seconds as well

Here is the code I have so far:
import pandas as pd
df = pd.read_csv('/content/drive/MyDrive/Colab Datasets/KickstarterRevised.csv')
df['deadline'] = pd.to_datetime(df['deadline'])
df['launched'] = pd.to_datetime(df['launched'])
df['difference'] = df['deadline'].sub(df['launched'], axis=0)
df['difference']
0 58 days 23:24:00
1 45 days 00:00:00
2 30 days 01:00:00
3 55 days 16:25:00
4 35 days 00:00:00
...
4994 40 days 00:00:00
4995 8 days 10:50:00
4996 38 days 18:53:00
4997 30 days 00:00:00
4998 30 days 00:00:00
Name: difference, Length: 4999, dtype: timedelta64[ns]
As you see from your code, df['difference'] is a Series with dtype: timedelta64[ns]. To get the days, just use .astype("timedelta64[D]"), see below
df['difference'] = df['deadline'].sub(df['launched'], axis=0).astype('timedelta64[D]')

Converting object into time and grouping/summarizing time (H/M/S) into 24 hours

I subsetted a big dataframe, slicing only one column Start Time with `type(object).
test = taxi_2020['Start Time']
Got a column
0 00:15:00
1 00:15:00
2 00:15:00
3 00:15:00
4 00:15:00
...
4137289 00:00:00
4137290 00:00:00
4137291 00:00:00
4137292 00:00:00
4137293 00:00:00
Name: Start Time, Length: 4137294, dtype: object
Then I grouped and summarized it by the count (to my best knowledge)
test.value_counts().sort_index().reset_index()
and got two columns
index Start Time
0 00:00:00 24005
1 00:15:00 22815
2 00:30:00 20438
3 00:45:00 19012
4 01:00:00 18082
... ... ...
91 22:45:00 32365
92 23:00:00 31815
93 23:15:00 29582
94 23:30:00 26903
95 23:45:00 24599
Not sure why this index column appeared, now I failed to rename it or convert.
What do I would like to see?
My ideal output - to group time by hour (24h format is ok), it looks like data counts every 15 min, so basically put each next 4 columns together. 00:15:00 is ok to be as 0 hour, 23:00:00 as 23rd hour.
My ideal output:
Hour Rides
0 34000
1 60000
2 30000
3 40000
I would like to create afterwards a simple histogram to show the occurrence by the hour.
Appreciate any help!
IIUC,
#Create dummy input datafframe
test = pd.DataFrame({'time':pd.date_range('2020-06-01', '2020-06-01 23:59:00', freq='15T').strftime('%H:%M:%S'),
'rides':np.random.randint(15000,28000,96)})
Let's create a DateTimeIndex from string and resample, aggregate with sum and convert DateTimeIndex to hours:
test2 = (test.set_index(pd.to_datetime(test['time'], format='%H:%M:%S'))
.rename_axis('hour').resample('H').sum())
test2.index = test2.index.hour
test2.reset_index()
Output:
hour rides
0 0 74241
1 1 87329
2 2 76933
3 3 86208
4 4 88002
5 5 82618
6 6 82188
7 7 81203
8 8 78591
9 9 95592
10 10 99778
11 11 85294
12 12 93931
13 13 80490
14 14 84181
15 15 71786
16 16 90962
17 17 96568
18 18 85646
19 19 88324
20 20 83595
21 21 89284
22 22 72061
23 23 74057
Step by step I found answer myself
Using this code, I renamed columns
test.rename(columns = {'index': "Time", 'Start Time': 'Rides'})
Got
The remaining question - how to summarize by the hour.
After applying
test2['hour'] = pd.to_datetime(test2['Time'], format='%H:%M:%S').dt.hour
test2
I came closer
Finally, I grouped by hour value
test3 = test2.groupby('hour', as_index=False).agg({"Rides": "sum"})
print(test3)

Correcting Pandas Cumulative Sum on a Timedelta Column

I'm currently have a line of code I'm using to try to create a column that is based on a Cumulative sum of timedelta data between dates. How ever its not correctly performing the Cumulative sum everywhere, and I was also given a warning that my line of python code wont work in the future.
The original dataset is below:
ID CREATION_DATE TIMEDIFF EDITNUMB
8211 11/26/2019 13:00 1
8211 1/3/2020 9:11 37 days 20:11:09.000000000 1
8211 2/3/2020 14:52 31 days 05:40:57.000000000 1
8211 3/27/2020 15:00 53 days 00:07:49.000000000 1
8211 4/29/2020 12:07 32 days 21:07:23.000000000 1
Here is my line of python code:
df['RECUR'] = df.groupby(['ID']).TIMEDIFF.apply(lambda x: x.shift().fillna(1).cumsum())
Which produces the new column 'RECUR' that is not summing cumulatively correctly from the data in the 'TIMEDIFF' column:
ID CREATION_DATE TIMEDIFF EDITNUMB RECUR
8211 11/26/2019 13:00 1 0 days 00:00:01.000000000
8211 1/3/2020 9:11 37 days 20:11:09.000000000 1 0 days 00:00:02.000000000
8211 2/3/2020 14:52 31 days 05:40:57.000000000 1 37 days 20:11:11.000000000
8211 3/27/2020 15:00 53 days 00:07:49.000000000 1 69 days 01:52:08.000000000
8211 4/29/2020 12:07 32 days 21:07:23.000000000 1 122 days 01:59:57.000000000
Which also produces this warning:
FutureWarning: Passing integers to fillna is deprecated, will raise a TypeError in a future version. To retain the old behavior, pass pd.Timedelta(seconds=n) instead.
Any help on this will be greatly appreciated, the sum total should be 153 days starting from 11/26/19, and correctly displayed cumulatively in the 'RECUR' column.
You can do:
# transform('first') would also work
df['RECUR'] = df['CREATION_DATE'] - df.groupby('ID').CREATION_DATE.transform('min')
Output:
ID CREATION_DATE TIMEDIFF EDITNUMB RECUR
0 8211 2019-11-26 13:00:00 NaT 1 0 days 00:00:00
1 8211 2020-01-03 09:11:00 37 days 20:11:00 1 37 days 20:11:00
2 8211 2020-02-03 14:52:00 31 days 05:41:00 1 69 days 01:52:00
3 8211 2020-03-27 15:00:00 53 days 00:08:00 1 122 days 02:00:00
4 8211 2020-04-29 12:07:00 32 days 21:07:00 1 154 days 23:07:00
You can fillna with a timedelta of 0 seconds and do the cumsum
df['RECUR'] = df.groupby('ID').TIMEDIFF.apply(
lambda x: x.fillna(pd.Timedelta(seconds=0)).cumsum())
df['RECUR']
# 0 0 days 00:00:00
# 1 37 days 20:11:09
# 2 69 days 01:52:06
# 3 122 days 01:59:55
# 4 154 days 23:07:18

Pandas Dataframe: difference between all dates for each unique id

[In 621]: df = pd.DataFrame({'id':[44,44,44,88,88,90,95],
'Status': ['Reject','Submit','Draft','Accept','Submit',
'Submit','Draft'],
'Datetime': ['2018-11-24 08:56:02',
'2018-10-24 18:12:02','2018-10-24 08:12:02',
'2018-10-29 13:17:02','2018-10-24 10:12:02',
'2018-12-30 08:43:12', '2019-01-24 06:12:02']
}, columns = ['id','Status', 'Datetime'])
df['Datetime'] = pd.to_datetime(df['Datetime'])
df
Out[621]:
id Status Datetime
0 44 Reject 2018-11-24 08:56:02
1 44 Submit 2018-10-24 18:12:02
2 44 Draft 2018-10-24 08:12:02
3 88 Accept 2018-10-29 13:17:02
4 88 Submit 2018-10-24 10:12:02
5 90 Submit 2018-12-30 08:43:12
6 95 Draft 2019-01-24 06:12:02
What I am trying to get is another column, e.g. df['Time in Status'] which is the time that id spent at that status.
I've looked at df.groupby() but only found answers (such as this one) for working out between two dates (e.g. first and last) regardless how how many dates are in between.
df['Datetime'] = pd.to_datetime(df['Datetime'])
g = df.groupby('id')['Datetime']
print(df.groupby('id')['Datetime'].apply(lambda g: g.iloc[-1] - g.iloc[0]))
id
44 -32 days +23:16:00
88 -6 days +20:55:00
90 0 days 00:00:00
95 0 days 00:00:00
Name: Datetime, dtype: timedelta64[ns]
The closest I've come to getting the result is DataFrameGroupBy.diff
df['Time in Status'] = df.groupby('id')['Datetime'].diff()
df
id Status Datetime Time in Status
0 44 Reject 2018-11-24 08:56:02 NaT
1 44 Submit 2018-10-24 18:12:02 -31 days +09:16:00
2 44 Draft 2018-10-24 08:12:02 -1 days +14:00:00
3 88 Accept 2018-10-29 13:17:02 NaT
4 88 Submit 2018-10-24 10:12:02 -6 days +20:55:00
5 90 Submit 2018-12-30 08:43:12 NaT
6 95 Draft 2019-01-24 06:12:02 NaT
However there are two issues with this. First, how can I do this calculation starting with the earliest date and working through until the end? E.g. so in row 2, instead of -1 days +14:00:00 it would be 0 Days 10:00:00? Or is this easier to solve by rearranging the order of the data before hand?
The other issue is the NaT. If there is no date to compare with, then the current day (i.e. datetime.now) would be used. I could apply this afterwards easy enough, but I was wondering if there might be a better solution to finding and replacing all the NaT values.
Exactly you are right, first is necessary sorting DataFrame.sort_values with both columns:
df = df.sort_values(['id', 'Datetime'])
df['Time in Status'] = df.groupby('id')['Datetime'].diff()
print (df)
id Status Datetime Time in Status
2 44 Draft 2018-10-24 08:12:02 NaT
1 44 Submit 2018-10-24 18:12:02 0 days 10:00:00
0 44 Reject 2018-11-24 08:56:02 30 days 14:44:00
4 88 Submit 2018-10-24 10:12:02 NaT
3 88 Accept 2018-10-29 13:17:02 5 days 03:05:00
5 90 Submit 2018-12-30 08:43:12 NaT
6 95 Draft 2019-01-24 06:12:02 NaT

Pandas dataframe: omit weekends and days near holidays

I have a Pandas dataframe with a DataTimeIndex and some other columns, similar to this:
import pandas as pd
import numpy as np
range = pd.date_range('2017-12-01', '2018-01-05', freq='6H')
df = pd.DataFrame(index = range)
# Average speed in miles per hour
df['value'] = np.random.randint(low=0, high=60, size=len(df.index))
df.info()
# DatetimeIndex: 141 entries, 2017-12-01 00:00:00 to 2018-01-05 00:00:00
# Freq: 6H
# Data columns (total 1 columns):
# value 141 non-null int64
# dtypes: int64(1)
# memory usage: 2.2 KB
df.head(10)
# value
# 2017-12-01 00:00:00 15
# 2017-12-01 06:00:00 54
# 2017-12-01 12:00:00 19
# 2017-12-01 18:00:00 13
# 2017-12-02 00:00:00 35
# 2017-12-02 06:00:00 31
# 2017-12-02 12:00:00 58
# 2017-12-02 18:00:00 6
# 2017-12-03 00:00:00 8
# 2017-12-03 06:00:00 30
How can I select or filter the entries that are:
Weekdays only (that is, not weekend days Saturday or Sunday)
Not within N days of the dates in a list (e.g. U.S. holidays like '12-25' or '01-01')?
I was hoping for something like:
df = exclude_Sat_and_Sun(df)
omit_days = ['12-25', '01-01']
N = 3 # days near the holidays
df = exclude_days_near_omit_days(N, omit_days)
I was thinking of creating a new column to break out the month and day and then comparing them to the criteria for 1 and 2 above. However, I was hoping for something more Pythonic using the DateTimeIndex.
Thanks for any help.
The first part can be easily accomplished using the Pandas DatetimeIndex.dayofweek property, which starts counting weekdays with Monday as 0 and ending with Sunday as 6.
df[df.index.dayofweek < 5] will give you only the weekdays.
For the second part you can use the datetime module. Below I will give an example for only one date, namely 2017-12-25. You can easily generalize it to a list of dates, for example by defining a helper function.
from datetime import datetime, timedelta
N = 3
df[abs(df.index.date - datetime.strptime("2017-12-25", '%Y-%m-%d').date()) > timedelta(N)]
This will give all dates that are more than N=3 days away from 2017-12-25. That is, it will exclude an interval of 7 days from 2017-12-22 to 2017-12-28.
Lastly, you can combine the two criteria using the & operator, as you probably know.
df[
(df.index.dayofweek < 5)
&
(abs(df.index.date - datetime.strptime("2017-12-25", '%Y-%m-%d').date()) > timedelta(N))
]
I followed the answer by #Bahman Engheta and created a function to omit dates from a dataframe.
import pandas as pd
from datetime import datetime, timedelta
def omit_dates(df, list_years, list_dates, omit_days_near=3, omit_weekends=False):
'''
Given a Pandas dataframe with a DatetimeIndex, remove rows that have a date
near a given list of dates and/or a date on a weekend.
Parameters:
----------
df : Pandas dataframe
list_years : list of str
Contains a list of years in string form
list_dates : list of str
Contains a list of dates in string form encoded as MM-DD
omit_days_near : int
Threshold of days away from list_dates to remove. For example, if
omit_days_near=3, then omit all days that are 3 days away from
any date in list_dates.
omit_weekends : bool
If true, omit dates that are on weekends.
Returns:
-------
Pandas dataframe
New resulting dataframe with dates omitted.
'''
if not isinstance(df, pd.core.frame.DataFrame):
raise ValueError("df is expected to be a Pandas dataframe, not %s" % type(df).__name__)
if not isinstance(df.index, pd.tseries.index.DatetimeIndex):
raise ValueError("Dataframe is expected to have an index of DateTimeIndex, not %s" %
type(df.index).__name__)
if not isinstance(list_years, list):
list_years = [list_years]
if not isinstance(list_dates, list):
list_dates = [list_dates]
result = df.copy()
if omit_weekends:
result = result.loc[result.index.dayofweek < 5]
omit_dates = [ '%s-%s' % (year, date) for year in list_years for date in list_dates ]
for date in omit_dates:
result = result.loc[abs(result.index.date - datetime.strptime(date, '%Y-%m-%d').date()) > timedelta(omit_days_near)]
return result
Here is example usage. Suppose you have a dataframe that has a DateTimeIndex and other columns, like this:
import pandas as pd
import numpy as np
range = pd.date_range('2017-12-01', '2018-01-05', freq='1D')
df = pd.DataFrame(index = range)
df['value'] = np.random.randint(low=0, high=60, size=len(df.index))
The resulting dataframe looks like this:
value
2017-12-01 42
2017-12-02 35
2017-12-03 49
2017-12-04 25
2017-12-05 19
2017-12-06 28
2017-12-07 21
2017-12-08 57
2017-12-09 3
2017-12-10 57
2017-12-11 46
2017-12-12 20
2017-12-13 7
2017-12-14 5
2017-12-15 30
2017-12-16 57
2017-12-17 4
2017-12-18 46
2017-12-19 32
2017-12-20 48
2017-12-21 55
2017-12-22 52
2017-12-23 45
2017-12-24 34
2017-12-25 42
2017-12-26 33
2017-12-27 17
2017-12-28 2
2017-12-29 2
2017-12-30 51
2017-12-31 19
2018-01-01 6
2018-01-02 43
2018-01-03 11
2018-01-04 45
2018-01-05 45
Now, let's specify dates to remove. I want to remove the dates '12-10', '12-25', '12-31', and '01-01' (following MM-DD notation) and all dates within 2 days of those dates. Further, I want to remove those dates from both the years '2016' and '2017'. I also want to remove weekend dates.
I'll call my function like this:
years = ['2016', '2017']
holiday_dates = ['12-10', '12-25', '12-31', '01-01']
omit_dates(df, years, holiday_dates, omit_days_near=2, omit_weekends=True)
The result is:
value
2017-12-01 42
2017-12-04 25
2017-12-05 19
2017-12-06 28
2017-12-07 21
2017-12-13 7
2017-12-14 5
2017-12-15 30
2017-12-18 46
2017-12-19 32
2017-12-20 48
2017-12-21 55
2017-12-22 52
2017-12-28 2
2018-01-03 11
2018-01-04 45
2018-01-05 45
Is that answer correct? Here are the calendars for December 2017 and January 2018:
December 2017
Su Mo Tu We Th Fr Sa
1 2
3 4 5 6 7 8 9
10 11 12 13 14 15 16
17 18 19 20 21 22 23
24 25 26 27 28 29 30
31
January 2018
Su Mo Tu We Th Fr Sa
1 2 3 4 5 6
7 8 9 10 11 12 13
14 15 16 17 18 19 20
21 22 23 24 25 26 27
28 29 30 31
Looks like it works.

Categories