Datetime difference between 2 columns with datetime/str - Python - python

I have a dataset - below
Create Complete
0 2005-01-02 01:15:00 2005-01-05 14:05:00
1 2005-01-06 00:00:00 open
I want to get the difference in minutes between the two using the below code. However as the 'complete' column also contains a string value, how can I get pandas to ign
df['diff_mins'] = df.Create - df.Complete

you can use pd.to_datetime for example:
import pandas as pd
df = pd.DataFrame([
['2005-01-02 01:15:00', '2005-01-05 14:05:00'],
['2005-01-06 00:00:00', 'open']],
columns=('Create', 'Complete')
)
and then:
df['diff_mins'] = (
pd.to_datetime(df.Create) - pd.to_datetime(df.Complete, errors='coerce')
)
to get the value in hours, just implement simple lambda function lambda x: x.total_seconds() / 60 / 60:
df['diff_mins_hours'] = (
pd.to_datetime(df.Create) - pd.to_datetime(df.Complete, errors='coerce')
).apply(lambda x: x.total_seconds() / 60 / 60)
give you:
print(df)
Create Complete diff_mins diff_mins_hours
0 2005-01-02 01:15:00 2005-01-05 14:05:00 -4 days +11:10:00 -84.833333
1 2005-01-06 00:00:00 open NaT NaN

I tried to do it using map. It should look something like this:
import datetime
def get_diff_mins(elem_a, elem_b):
if (elem_b=='open'):
elem_b = datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S")
a = elem_a.replace(' ', '-').replace(':','-').split('-')
b = elem_b.replace(' ', '-').replace(':','-').split('-')
# Roughly converts yearly time to mins
# since month is always considered 30 days
f = [60*24*30*12, 60*24*30, 60*24, 60, 1, 0]
mins_a = sum([int(a)*f for a,f in zip(a,f)])
mins_b = sum([int(b)*f for b,f in zip(b,f)])
return mins_a-mins_b
df['diff_mins'] = map(get_diff_mins, df.Create, df.Complete)

Related

Convert a digit code into datetime format in a Pandas Dataframe

I have a pandas dataframe that has a column with a 5 digit code that represent a day and time, and it works like following:
1 - The first three digits represent the day;
2 - The last two digits represent the hour:minute:second.
Example1: The first row have the code 19501, so the 195 represent the 1st of January of 2009 and the 01 part represents the time from 00:00:00 to 00:29:59;
Example2: In the second row i have the code 19502 which is the 1st of January of 2009 from 00:30:00 to 00:59:59;
Example3: Another example, 19711 would be the 3rd of January of 2009 from 05:00:00 to 05:29:59;
Example4: The last row is the code 73048, which represent the 20th of June of 2010 from 23:30:00 to 23:59:59.
Any ideas in how can I convert this 5 digit code into a proper datetime format?
I'm assuming your column is numeric.
import datetime as dt
df = pd.DataFrame({'code': [19501, 19502, 19711, 73048]})
df['days'] = pd.to_timedelta(df['code']//100, 'D')
df['half-hours'] = df['code']%100
df['hours'] = pd.to_timedelta(df['half-hours']//2, 'h')
df['minutes'] = pd.to_timedelta(df['half-hours']%2*30, 'm')
base_day = dt.datetime(2009, 1, 1) - dt.timedelta(days = 195)
df['dt0'] = base_day + df.days + df.hours + df.minutes - dt.timedelta(minutes = 30)
df['dt1'] = base_day + df.days + df.hours + df.minutes - dt.timedelta(seconds = 1)
A simple solution, add the days to 2008-06-20, add the (time-1)*30min;
df = pd.DataFrame({'code': [19501, 19502, 19711, 73048]})
d, t = df['code'].divmod(100)
df['datetime'] = (
pd.to_timedelta(d, unit='D')
.add(pd.Timestamp('2008-06-20'))
.add(pd.to_timedelta((t-1)*30, unit='T'))
)
NB. this gives you the start of the period, for the end replace (t-1)*30 by t*30-1.
Output:
code datetime
0 19501 2009-01-01 00:00:00
1 19502 2009-01-01 00:30:00
2 19711 2009-01-03 05:00:00
3 73048 2010-06-20 23:30:00

Pandas dataframe timedelta is giving exceptions

I am trying to get the next month first date based on billDate in a dataframe.
I did this:
import pandas as pd
import datetime
from datetime import timedelta
dt = pd.to_datetime('15/4/2019', errors='coerce')
print(dt)
print((dt.replace(day=1) + datetime.timedelta(days=32)).replace(day=1))
It is working perfectly, and the output is :
2019-04-15 00:00:00
2019-05-01 00:00:00
Now, I am applying same logic in my dataframe in the below code
df[comNewColName] = (pd.to_datetime(df['billDate'], errors='coerce').replace(day=1) + datetime.timedelta(days=32)).replace(day=1)
But I am getting error like this:
---> 69 df[comNewColName] = (pd.to_datetime(df['billDate'], errors='coerce').replace(day=1) + datetime.timedelta(days=32)).replace(day=1)
70 '''print(df[['billDate']])'''
71 '''df = df.assign(Product=lambda x: (x['Field_1'] * x['Field_2'] * x['Field_3']))'''
TypeError: replace() got an unexpected keyword argument 'day'
You can use Series.to_period for month periods, add 1 for next month and then convert back to datetimes by Series.dt.to_timestamp:
print (df)
billDate
0 15/4/2019
1 30/4/2019
2 15/8/2019
df['billDate'] = (pd.to_datetime(df['billDate'], errors='coerce', dayfirst=True)
.dt.to_period('m')
.add(1)
.dt.to_timestamp())
print (df)
billDate
0 2019-05-01
1 2019-05-01
2 2019-09-01

average time being active

I do have a json array, where i will be having id, starttime, endtime. I want to calculate average time being active by user. And some may have only startime but not endtime.
Example data -
data = [{"id":1, "stime":"2020-09-21T06:25:36Z","etime": "2020-09-22T09:25:36Z"},{"id":2, "stime":"2020-09-22T02:24:36Z","etime": "2020-09-23T07:25:36Z"},{"id":3, "stime":"2020-09-20T06:25:36Z","etime": "2020-09-24T09:25:36Z"},{"id":4, "stime":"2020-09-23T06:25:36Z","etime": "2020-09-29T09:25:36Z"}]
My method to achieve this, diff between startine and endtime. then total all difference time and divide by number of total num of Ids.
sample code:
import datetime
from datetime import timedelta
import dateutil.parser
datetimeFormat = '%Y-%m-%d %H:%M:%S.%f'
date_s_time = '2020-09-21T06:25:36Z'
date_e_time = '2020-09-22T09:25:36Z'
d1 = dateutil.parser.parse(date_s_time)
d2 = dateutil.parser.parse(date_e_time)
diff1 = datetime.datetime.strptime(d2.strftime('%Y-%m-%d %H:%M:%S.%f'), datetimeFormat)\
- datetime.datetime.strptime(d1.strftime('%Y-%m-%d %H:%M:%S.%f'), datetimeFormat)
print("Difference 1:", diff1)
date_s_time2 = '2020-09-20T06:25:36Z'
date_e_time2 = '2020-09-28T02:25:36Z'
d3 = dateutil.parser.parse(date_s_time2)
d4 = dateutil.parser.parse(date_e_time2)
diff2 = datetime.datetime.strptime(d4.strftime('%Y-%m-%d %H:%M:%S.%f'), datetimeFormat)\
- datetime.datetime.strptime(d3.strftime('%Y-%m-%d %H:%M:%S.%f'), datetimeFormat)
print("Difference 2:", diff2)
print("total", diff1+diff2)
print(diff1+diff2/2)
please suggest me is there a better approach which will be efficient.
You could use the pandas library.
import pandas as pd
data = [{"id":1, "stime":"2020-09-21T06:25:36Z","etime": "2020-09-22T09:25:36Z"},{"id":1, "stime":"2020-09-22T02:24:36Z","etime": "2020-09-23T07:25:36Z"},{"id":1, "stime":"2020-09-20T06:25:36Z","etime": "2020-09-24T09:25:36Z"},{"id":1, "stime":"2020-09-23T06:25:36Z"}]
(Let's say your last row has no end time)
Now, you can create a Pandas DataFrame using your data
df = pd.DataFrame(data)
df looks like so:
id stime etime
0 1 2020-09-21T06:25:36Z 2020-09-22T09:25:36Z
1 1 2020-09-22T02:24:36Z 2020-09-23T07:25:36Z
2 1 2020-09-20T06:25:36Z 2020-09-24T09:25:36Z
3 1 2020-09-23T06:25:36Z NaN
Now, we want to map the columns stime and etime so that the strings are converted to datetime objects, and fill NaNs with something that makes sense: if no end time exists, could we use the current time?
df = df.fillna(datetime.utcnow().strftime('%Y-%m-%dT%H:%M:%SZ'))
df['etime'] = df['etime'].map(dateutil.parser.parse)
df['stime'] = df['stime'].map(dateutil.parser.parse)
Or, if you want to drop the rows that don't have an etime, just do
df = df.dropna()
Now df becomes:
id stime etime
0 1 2020-09-21 06:25:36+00:00 2020-09-22 09:25:36+00:00
1 1 2020-09-22 02:24:36+00:00 2020-09-23 07:25:36+00:00
2 1 2020-09-20 06:25:36+00:00 2020-09-24 09:25:36+00:00
3 1 2020-09-23 06:25:36+00:00 2020-09-24 20:05:42+00:00
Finally, subtract the two:
df['tdiff'] = df['etime'] - df['stime']
and we get:
id stime etime tdiff
0 1 2020-09-21 06:25:36+00:00 2020-09-22 09:25:36+00:00 1 days 03:00:00
1 1 2020-09-22 02:24:36+00:00 2020-09-23 07:25:36+00:00 1 days 05:01:00
2 1 2020-09-20 06:25:36+00:00 2020-09-24 09:25:36+00:00 4 days 03:00:00
3 1 2020-09-23 06:25:36+00:00 2020-09-24 20:05:42+00:00 1 days 13:40:06
The mean of this column is:
df['tdiff'].mean()
Output: Timedelta('2 days 00:10:16.500000')

Negative time duration in Pandas

I have a dataset with two columns: Actual Time and Promised Time (representing the actual and promised start times of some process).
For example:
import pandas as pd
example_df = pd.DataFrame(columns = ['Actual Time', 'Promised Time'],
data = [
('2016-6-10 9:00', '2016-6-10 9:00'),
('2016-6-15 8:52', '2016-6-15 9:52'),
('2016-6-19 8:54', '2016-6-19 9:02')]).applymap(pd.Timestamp)
So as we can see, sometimes Actual Time = Promised Time, but there are also cases where Actual Time < Promised Time.
I defined a column that shows the difference between these two columns (example_df['Actual Time']-example_df['Promised Time']), but the problem is that for the third row it returned -1 day +23:52:00 instead of - 00:08:00.
Sample:
print (df)
Actual Time Promised Time
0 2016-6-10 9:00 2016-6-10 9:00
1 2016-6-15 10:52 2016-6-15 9:52 <- changed datetimes
2 2016-6-19 8:54 2016-6-19 9:02
def format_timedelta(x):
ts = x.total_seconds()
if ts >= 0:
hours, remainder = divmod(ts, 3600)
minutes, seconds = divmod(remainder, 60)
return ('{}:{:02d}:{:02d}').format(int(hours), int(minutes), int(seconds))
else:
hours, remainder = divmod(-ts, 3600)
minutes, seconds = divmod(remainder, 60)
return ('-{}:{:02d}:{:02d}').format(int(hours), int(minutes), int(seconds))
First create datetimes:
df['Actual Time'] = pd.to_datetime(df['Actual Time'])
df['Promised Time'] = pd.to_datetime(df['Promised Time'])
And then timedeltas:
df['diff'] = (df['Actual Time'] - df['Promised Time'])
If convert negative timedeltas to seconds by Series.dt.total_seconds it working nice:
df['diff1'] = df['diff'].dt.total_seconds()
But if want negative timedeltas in string representation it is possible with custom function, because strftime for timedeltas is not yet implemented:
df['diff2'] = df['diff'].apply(format_timedelta)
print (df)
Actual Time Promised Time diff diff1 diff2
0 2016-06-10 09:00:00 2016-06-10 09:00:00 00:00:00 0.0 0:00:00
1 2016-06-15 10:52:00 2016-06-15 09:52:00 01:00:00 3600.0 1:00:00
2 2016-06-19 08:54:00 2016-06-19 09:02:00 -1 days +23:52:00 -480.0 -0:08:00
I assume your dataframe already in datetime dtype. abs works just fine
Without abs
df['Actual Time'] - df['Promised Time']
Out[526]:
0 00:00:00
1 -1 days +23:00:00
2 -1 days +23:52:00
dtype: timedelta64[ns]
With abs
abs(df['Promised Time'] - df['Actual Time'])
Out[529]:
0 00:00:00
1 01:00:00
2 00:08:00
dtype: timedelta64[ns]
The difference result is timedelta type which by default is in ns format.
You need to change the type of your result to you desired format:
import pandas as pd
df=pd.DataFrame(data={
'Actual Time':['2016-6-10 9:00','2016-6-15 8:52','2016-6-19 8:54'],
'Promised Time':['2016-6-10 9:00','2016-6-15 9:52','2016-6-19 9:02']
},dtype='datetime64[ns]')
# here you need to add the `astype` part and to determine the unit you want
df['diff']=(df['Actual Time']-df['Promised Time']).astype('timedelta64[m]')

Calculate datetime difference in years, months, etc. in a new pandas dataframe column

I have a pandas dataframe looking like this:
Name start end
A 2000-01-10 1970-04-29
I want to add a new column providing the difference between the start and end column in years, months, days.
So the result should look like:
Name start end diff
A 2000-01-10 1970-04-29 29y9m etc.
the diff column may also be a datetime object or a timedelta object, but the key point for me is, that I can easily get the Year and Month out of it.
What I tried until now is:
df['diff'] = df['end'] - df['start']
This results in the new column containing 10848 days. However, I do not know how to convert the days to 29y9m etc.
You can try by creating a new column with years in this way:
df['diff_year'] = df['diff'] / np.timedelta64(1, 'Y')
Pretty much straightforward with relativedelta:
from dateutil import relativedelta
>> end start
>> 0 1970-04-29 2000-01-10
for i in df.index:
df.at[i, 'diff'] = relativedelta.relativedelta(df.ix[i, 'start'], df.ix[i, 'end'])
>> end start diff
>> 0 1970-04-29 2000-01-10 relativedelta(years=+29, months=+8, days=+12)
A much simpler way is to use date_range function and calculate length of the same
startdt=pd.to_datetime('2017-01-01')
enddt = pd.to_datetime('2018-01-01')
len(pd.date_range(start=startdt,end=enddt,freq='M'))
With a simple function you can reach your goal.
The function calculates the years difference and the months difference with a simple calculation.
import pandas as pd
import datetime
def parse_date(td):
resYear = float(td.days)/364.0 # get the number of years including the the numbers after the dot
resMonth = int((resYear - int(resYear))*364/30) # get the number of months, by multiply the number after the dot by 364 and divide by 30.
resYear = int(resYear)
return str(resYear) + "Y" + str(resMonth) + "m"
df = pd.DataFrame([("2000-01-10", "1970-04-29")], columns=["start", "end"])
df["delta"] = [parse_date(datetime.datetime.strptime(start, '%Y-%m-%d') - datetime.datetime.strptime(end, '%Y-%m-%d')) for start, end in zip(df["start"], df["end"])]
print df
start end delta
0 2000-01-10 1970-04-29 29Y9m
I think this is the most 'pandas' way to do it, without using any for loops or defining external functions:
>>> df = pd.DataFrame({'Name': ['A'], 'start': [datetime(2000, 1, 10)], 'end': [datetime(1970, 4, 29)]})
>>> df['diff'] = map(lambda td: datetime(1, 1, 1) + td, list(df['start'] - df['end']))
>>> df['diff'] = df['diff'].apply(lambda d: '{0}y{1}m'.format(d.year - 1, d.month - 1))
>>> df
Name end start diff
0 A 1970-04-29 2000-01-10 29y8m
Had to use map instead of apply because of pandas' timedelda64, which doesn't allow a simple addition to a datetime object.
You can try the following function to calculate the difference -
def yearmonthdiff(row):
s = row['start']
e = row['end']
y = s.year - e.year
m = s.month - e.month
d = s.day - e.day
if m < 0:
y = y - 1
m = m + 12
if m == 0:
if d < 0:
m = m -1
elif d == 0:
s1 = s.hour*3600 + s.minute*60 + s.second
s2 = e.hour*3600 + e.minut*60 + e.second
if s1 < s2:
m = m - 1
return '{}y{}m'.format(y,m)
Where row is the dataframe row . I am assuming your start and end columns are datetime objects. Then you can use DataFrame.apply() function to apply it to each row.
df
Out[92]:
start end
0 2000-01-10 00:00:00.000000 1970-04-29 00:00:00.000000
1 2015-07-18 17:54:59.070381 2014-01-11 17:55:10.053381
df['diff'] = df.apply(yearmonthdiff, axis=1)
In [97]: df
Out[97]:
start end diff
0 2000-01-10 00:00:00.000000 1970-04-29 00:00:00.000000 29y9m
1 2015-07-18 17:54:59.070381 2014-01-11 17:55:10.053381 1y6m
Similar to #DeepSpace's answer, here's a SAS-like implementation:
import pandas as pd
from dateutil import relativedelta
def intck_month( start, end ):
rd = relativedelta.relativedelta( pd.to_datetime( end ), pd.to_datetime( start ) )
return rd.years, rd.months
Usage:
>> years, months = intck_month('1960-01-01', '1970-03-01')
>> print(years)
10
>> print(months)
2
What you are essentially doing is subtracting the dates, then you get the days, convert the days into a string and split by " " and from the resulting list, the number of days is 1st item in the list. convert that to integer and divide by 365.
ad['yrs']=(ad.last_dt-ad.dt).apply(lambda x: str(x).split(' ')[0]).apply(lambda x: int(x)/365)
You can find the total number of seconds and calculate the rest:
diff = pd.to_datetime('2023-01-01') - pd.to_datetime('2021-01-01')
diff.total_seconds() / (365 * 24 * 60 * 60) # years
# 2.0
diff.total_seconds() / (30 * 24 * 60 * 60) # months
# 24.333333333333332
diff.total_seconds() / (24 * 60 * 60) # days
# 730.0
For Pandas Series use the dt accessor: df['diff'].dt.total_seconds().

Categories