Signed time deltas to signed seconds in Pandas - python

Consider the following series:
> df['time_delta']
0 -1 days +00:08:11
1 0 days 01:57:46
2 0 days 00:58:34
3 0 days 17:30:23
4 -1 days +21:44:34
5 -2 days +22:01:56
6 0 days 03:18:57
7 -1 days +21:44:48
8 -1 days +00:07:56
Name: time_delta, dtype: timedelta64[ns]
Say I want to convert this timedelta to total signed seconds. That is:
Positive deltas should convert to positive seconds
Negative deltas should convert to negative seconds
For example:
0 days 00:01:05 => 65 seconds
-1 days +23:58:30 => -90 seconds
How can I get this conversion?
Failed attempt
When I try the usual:
temp_df['seconds'] = temp_df['time_delta'].dt.seconds
I end up with:
time_delta seconds
0 -1 days +00:08:11 491
1 0 days 01:57:46 7066
2 0 days 00:58:34 3514
3 0 days 17:30:23 63023
4 -1 days +21:44:34 78274
5 -2 days +22:01:56 79316
6 0 days 03:18:57 11937
7 -1 days +21:44:48 78288
8 -1 days +00:07:56 476
which correctly handled positive deltas, but not the negative ones. To see this, note that the negative deltas seem to ignore the sign of the day offset. That is, in the example above:
-1 days +21:44:48 should convert to -8112 seconds, not 78288 seconds (wrong sign and value).

If it's a Timedelta object, just divide it by Timedelta(seconds=1):
>>> pd.Timedelta(days=-1) / pd.Timedelta(seconds=1)
-86400.0

just call abs prior to dt.total_seconds to get the absolute values:
df['seconds'] = df['time_delta'].abs().dt.total_seconds()
Example:
In [63]:
df = pd.DataFrame({'date_time':pd.date_range(dt.datetime(2015,1,1,12,10,32), dt.datetime(2015,1,3,12,12,30,2))})
df['time_delta'] = df['date_time'] - dt.datetime(2015,1,2)
df
Out[63]:
date_time time_delta
0 2015-01-01 12:10:32 -1 days +12:10:32
1 2015-01-02 12:10:32 0 days 12:10:32
2 2015-01-03 12:10:32 1 days 12:10:32
In [64]:
df['time_delta'].abs().dt.total_seconds()
Out[64]:
0 42568
1 43832
2 130232
Name: time_delta, dtype: float64
To add the signs back you can compare against pd.Timedelta(0):
In [78]:
df['seconds'] = df['time_delta'].abs().dt.total_seconds()
df.loc[df['time_delta'] < pd.Timedelta(0), 'seconds'] = -df['seconds']
df
Out[78]:
date_time time_delta seconds
0 2015-01-01 12:10:32 -1 days +12:10:32 -42568
1 2015-01-02 12:10:32 0 days 12:10:32 43832
2 2015-01-03 12:10:32 1 days 12:10:32 130232
However, I think #Ami Tamory's answer is superior
EDIT
After sleeping on this I realised that this is just dt.total_seconds:
In [137]:
df['time_delta'].dt.total_seconds()
Out[137]:
0 -42568
1 43832
2 130232
Name: time_delta, dtype: float64

Related

Convert multiple time format object as datetime format

I have a dataframe with a list of time value as object and needed to convert them to datetime, the issue is, they are not on the same format so when I try:
df['Total call time'] = pd.to_datetime(df['Total call time'], format='%H:%M:%S')
it gives me an error
ValueError: time data '3:22' does not match format '%H:%M:%S' (match)
or if use this code
df['Total call time'] = pd.to_datetime(df['Total call time'], format='%H:%M')
I get this error
ValueError: unconverted data remains: :58
These are the values on my data
Total call time
2:04:07
3:22:41
2:30:41
2:19:06
1:45:55
1:30:08
1:32:15
1:43:28
**45:48**
1:41:40
5:08:37
**3:22**
4:29:05
2:47:25
2:39:29
2:29:32
2:09:52
3:31:57
2:27:58
2:34:28
3:14:10
2:12:10
2:46:58
times = """\
2:04:07
3:22:41
2:30:41
2:19:06
1:45:55
1:30:08
1:32:15
1:43:28
45:48
1:41:40
5:08:37
3:22
4:29:05
2:47:25
2:39:29
2:29:32
2:09:52
3:31:57
2:27:58
2:34:28
3:14:10
2:12:10
2:46:58""".split()
import pandas as pd
df = pd.DataFrame(times, columns=['elapsed'])
def pad(s):
if len(s) == 4:
return '00:0'+s
elif len(s) == 5:
return '00:'+s
return s
print(pd.to_timedelta(df['elapsed'].apply(pad)))
Output:
0 0 days 02:04:07
1 0 days 03:22:41
2 0 days 02:30:41
3 0 days 02:19:06
4 0 days 01:45:55
5 0 days 01:30:08
6 0 days 01:32:15
7 0 days 01:43:28
8 0 days 00:45:48
9 0 days 01:41:40
10 0 days 05:08:37
11 0 days 00:03:22
12 0 days 04:29:05
13 0 days 02:47:25
14 0 days 02:39:29
15 0 days 02:29:32
16 0 days 02:09:52
17 0 days 03:31:57
18 0 days 02:27:58
19 0 days 02:34:28
20 0 days 03:14:10
21 0 days 02:12:10
22 0 days 02:46:58
Name: elapsed, dtype: timedelta64[ns]
Alternatively to grovina's answer ... instead of using apply you can directly use the dt accessor.
Here's a sample:
>>> data = [['2017-12-01'], ['2017-12-
30'],['2018-01-01']]
>>> df = pd.DataFrame(data=data,
columns=['date'])
>>> df
date
0 2017-12-01
1 2017-12-30
2 2018-01-01
>>> df.date
0 2017-12-01
1 2017-12-30
2 2018-01-01
Name: date, dtype: object
Note how df.date is an object? Let's turn it into a date like you want
>>> df.date = pd.to_datetime(df.date)
>>> df.date
0 2017-12-01
1 2017-12-30
2 2018-01-01
Name: date, dtype: datetime64[ns]
The format you want is for string formatting. I don't think you'll be able to convert the actual datetime64 to look like that format. For now, let's make a newly formatted string version of your date in a separate column
>>> df['new_formatted_date'] =
df.date.dt.strftime('%d/%m/%y %H:%M')
>>> df.new_formatted_date
0 01/12/17 00:00
1 30/12/17 00:00
2 01/01/18 00:00
Name: new_formatted_date, dtype: object
Finally, since the df.date column is now of date datetime64... you can use the dt accessor right on it. No need to use apply
>>> df['month'] = df.date.dt.month
>>> df['day'] = df.date.dt.day
>>> df['year'] = df.date.dt.year
>>> df['hour'] = df.date.dt.hour
>>> df['minute'] = df.date.dt.minute
>>> df
date new_formatted_date month day
year hour minute
0 2017-12-01 01/12/17 00:00 12
1 2017 0 0
1 2017-12-30 30/12/17 00:00 12
30 2017 0 0
2 2018-01-01 01/01/18 00:00
Another idea is test if double : and if not added :00 with converting to timedeltas by to_timedelta, also is test if number before first : is less like 23 - then is parsing like HH:MM, if is greater is parising like MM:SS:
m1 = df['Total call time'].str.count(':').ne(2)
m2 = df['Total call time'].str.extract('^(\d+):', expand=False).astype(float).gt(23)
s = np.select([m1 & m2, m1 & ~m2],
['00:' + df['Total call time'], df['Total call time']+ ':00'],
df['Total call time'] )
df['Total call time'] = pd.to_timedelta(s)
print (df)
Total call time
0 0 days 02:04:07
1 0 days 03:22:41
2 0 days 02:30:41
3 0 days 02:19:06
4 0 days 01:45:55
5 0 days 01:30:08
6 0 days 01:32:15
7 0 days 01:43:28
8 0 days 00:45:48
9 0 days 01:41:40
10 0 days 05:08:37
11 0 days 03:22:00
12 0 days 04:29:05
13 0 days 02:47:25
14 0 days 02:39:29
15 0 days 02:29:32
16 0 days 02:09:52
17 0 days 03:31:57
18 0 days 02:27:58
19 0 days 02:34:28
20 0 days 03:14:10
21 0 days 02:12:10
22 0 days 02:46:58

Convert string hours to minute pd.eval

I want to convert all rows of my DataFrame that contains hours and minutes into minutes only.
I have a dataframe that looks like this:
df=
time
0 8h30
1 14h07
2 08h30
3 7h50
4 8h0
5 8h15
6 6h15
I'm using the following method to convert:
df['time'] = pd.eval(
df['time'].replace(['h'], ['*60+'], regex=True))
Output
SyntaxError: invalid syntax
I think the error comes from the format of the hour, maybe pd.evalcant accept 08h30 or 8h0, how to solve this probleme ?
Pandas can already handle such strings if the units are included in the string. While 14h07 can't be parse (why assume 07 is minutes?), 14h07 can be converted to a Timedelta :
>>> pd.to_timedelta("14h07m")
Timedelta('0 days 14:07:00')
Given this dataframe :
d1 = pd.DataFrame(['8h30m', '14h07m', '08h30m', '8h0m'],
columns=['time'])
You can convert the time series into a Timedelta series with pd.to_timedelta :
>>> d1['tm'] = pd.to_timedelta(d1['time'])
>>> d1
time tm
0 8h30m 0 days 08:30:00
1 14h07m 0 days 14:07:00
2 08h30m 0 days 08:30:00
3 8h0m 0 days 08:00:00
To handle the missing minutes unit in the original data, just append m:
d1['tm'] = pd.to_timedelta(d1['time'] + 'm')
Once you have a Timedelta you can calculate hours and minutes.
The components of the values can be retrieved with Timedelta.components
>>> d1.tm.dt.components.hours
0 8
1 14
2 8
3 8
Name: hours, dtype: int64
To get the total minutes, seconds or hours, change the frequency to minutes:
>>> d1.tm.astype('timedelta64[m]')
0 510.0
1 847.0
2 510.0
3 480.0
Name: tm, dtype: float64
Bringing all the operations together :
>>> d1['tm'] = pd.to_timedelta(d1['time'])
>>> d2 = (d1.assign(h=d1.tm.dt.components.hours,
... m=d1.tm.dt.components.minutes,
... total_minutes=d1.tm.astype('timedelta64[m]')))
>>>
>>> d2
time tm h m total_minutes
0 8h30m 0 days 08:30:00 8 30 510.0
1 14h07m 0 days 14:07:00 14 7 847.0
2 08h30m 0 days 08:30:00 8 30 510.0
3 8h0m 0 days 08:00:00 8 0 480.0
To avoid having to trim leading zeros, an alternative approach:
df[['h', 'm']] = df['time'].str.split('h', expand=True).astype(int)
df['total_min'] = df['h']*60 + df['m']
Result:
time h m total_min
0 8h30 8 30 510
1 14h07 14 7 847
2 08h30 8 30 510
3 7h50 7 50 470
4 8h0 8 0 480
5 8h15 8 15 495
6 6h15 6 15 375
Just to give an alternative approach with kind of the same elements as above you could do:
df = pd.DataFrame(data=["8h30", "14h07", "08h30", "7h50", "8h0 ", "8h15", "6h15"],
columns=["time"])
First split you column on the "h"
hm = df["time"].str.split("h", expand=True)
Then combine the columns again, but zeropad time hours and minutes in order to make valid time strings:
df2 = hm[0].str.strip().str.zfill(2) + hm[1].str.strip().str.zfill(2)
Then convert the string column with proper values to a date time column:
df3 = pd.to_datetime(df2, format="%H%M")
Finally, calculate the number of minutes by subtrackting a zero time (to make deltatimes) and divide by the minutes deltatime:
zerotime= pd.to_datetime("0000", format="%H%M")
df['minutes'] = (df3 - zerotime) / pd.Timedelta(minutes=1)
The results look like:
time minutes
0 8h30 510.0
1 14h07 847.0
2 08h30 510.0
3 7h50 470.0
4 8h0 480.0
5 8h15 495.0
6 6h15 375.0

Pandas read format %D:%H:%M:%S with python

Currently I am reading in a data frame with the timestamp from film 00(days):00(hours clocks over at 24 to day):00(min):00(sec)
pandas reads time formats HH:MM:SS and YYYY:MM:DD HH:MM:SS fine.
Though is there a way of having pandas read the duration of time such as the DD:HH:MM:SS.
Alternatively using timedelta how would I go about getting the DD into HH in the data frame so that pandas can make it "1 day HH:MM:SS" for example
Data sample
00:00:00:00
00:07:33:57
02:07:02:13
00:00:13:11
00:00:10:11
00:00:00:00
00:06:20:06
01:12:13:25
Expected output for last sample
36:13:25
Thanks
If you want timedelta objects, a simple way is to replace the first colon with days :
df['timedelta'] = pd.to_timedelta(df['col'].str.replace(':', 'days ', n=1))
output:
col timedelta
0 00:00:00:00 0 days 00:00:00
1 00:07:33:57 0 days 07:33:57
2 02:07:02:13 2 days 07:02:13
3 00:00:13:11 0 days 00:13:11
4 00:00:10:11 0 days 00:10:11
5 00:00:00:00 0 days 00:00:00
6 00:06:20:06 0 days 06:20:06
7 01:12:13:25 1 days 12:13:25
>>> df.dtypes
col object
timedelta timedelta64[ns]
dtype: object
From there it's also relatively easy to combine the days and hours as string:
c = df['timedelta'].dt.components
df['str_format'] = ((c['hours']+c['days']*24).astype(str)
+df['col'].str.split('(?=:)', n=2).str[-1]).str.zfill(8)
output:
col timedelta str_format
0 00:00:00:00 0 days 00:00:00 00:00:00
1 00:07:33:57 0 days 07:33:57 07:33:57
2 02:07:02:13 2 days 07:02:13 55:02:13
3 00:00:13:11 0 days 00:13:11 00:13:11
4 00:00:10:11 0 days 00:10:11 00:10:11
5 00:00:00:00 0 days 00:00:00 00:00:00
6 00:06:20:06 0 days 06:20:06 06:20:06
7 01:12:13:25 1 days 12:13:25 36:13:25
Convert days separately, add to times and last call custom function:
def f(x):
ts = x.total_seconds()
hours, remainder = divmod(ts, 3600)
minutes, seconds = divmod(remainder, 60)
return ('{}:{:02d}:{:02d}').format(int(hours), int(minutes), int(seconds))
d = pd.to_timedelta(df['col'].str[:2].astype(int), unit='d')
td = pd.to_timedelta(df['col'].str[3:])
df['col'] = d.add(td).apply(f)
print (df)
col
0 0:00:00
1 7:33:57
2 55:02:13
3 0:13:11
4 0:10:11
5 0:00:00
6 6:20:06
7 36:13:25

Pandas and DateTime TypeError: cannot compare a TimedeltaIndex with type float

I have a pandas DataFrame Series time differences that looks like::
print(delta_t)
1 0 days 00:00:59
3 0 days 00:04:22
6 0 days 00:00:56
8 0 days 00:01:21
19 0 days 00:01:09
22 0 days 00:00:36
...
(the full DataFrame had a bunch of NaNs which I dropped).
I'd like to know which delta_t's are less than 1 day, 1 hour, 1 minute,
so I tried:
delta_t_lt1day = delta_t[np.where(delta_t < 30.)]
but then got a:
TypeError: cannot compare a TimedeltaIndex with type float
Little help?!?!
Assuming your Series is in timedelta format, you can skip the np.where, and index using something like this, where you compare your actual values to other timedeltas, using the appropriate units:
delta_t_lt1day = delta_t[delta_t < pd.Timedelta(1,'D')]
delta_t_lt1hour = delta_t[delta_t < pd.Timedelta(1,'h')]
delta_t_lt1minute = delta_t[delta_t < pd.Timedelta(1,'m')]
You'll get the following series:
>>> delta_t_lt1day
0
1 00:00:59
3 00:04:22
6 00:00:56
8 00:01:21
19 00:01:09
22 00:00:36
Name: 1, dtype: timedelta64[ns]
>>> delta_t_lt1hour
0
1 00:00:59
3 00:04:22
6 00:00:56
8 00:01:21
19 00:01:09
22 00:00:36
Name: 1, dtype: timedelta64[ns]
>>> delta_t_lt1minute
0
1 00:00:59
6 00:00:56
22 00:00:36
Name: 1, dtype: timedelta64[ns]
You could use the TimeDelta class:
import pandas as pd
deltas = pd.to_timedelta(['0 days 00:00:59',
'0 days 00:04:22',
'0 days 00:00:56',
'0 days 00:01:21',
'0 days 00:01:09',
'0 days 00:31:09',
'0 days 00:00:36'])
for e in deltas[deltas < pd.Timedelta(value=30, unit='m')]:
print(e)
Output
0 days 00:00:59
0 days 00:04:22
0 days 00:00:56
0 days 00:01:21
0 days 00:01:09
0 days 00:00:36
Note that this filter outs '0 days 00:31:09' as expected. The expression pd.Timedelta(value=30, unit='m') creates a time delta of 30 minutes.

How to plot timedelta data from a pandas DataFrame?

I am trying to plot a Series (a columns from a dataframe to be precise). It seems to have valid data in the format hh:mm:ss (timedelta64)
In [14]: x5.task_a.describe()
Out[14]:
count 165
mean 0 days 03:35:41.121212
std 0 days 07:07:40.950819
min 0 days 00:00:06
25% 0 days 00:37:13
50% 0 days 01:28:17
75% 0 days 03:41:32
max 2 days 12:32:26
Name: task_a, dtype: object
In [15]: x5.task_a.head()
Out[15]:
wbdqueue_id
26868 00:26:11
26869 02:08:28
26872 00:26:07
26874 00:48:22
26875 00:26:17
Name: task_a, dtype: timedelta64[ns]
But when I try to plot it, I get an error saying there is no numeric data in the Empty 'DataFrame'.
I've tried:
x5.task_a.plot.kde()
and
x5.plot()
where x5 is the DataFrame with several Series of such timedelta data.
TypeError: Empty 'DataFrame': no numeric data to plot
I see that one can generate series of random values and plot it.
What am I doing wrong?
Convert to any logical numeric values, like hours or minutes, and then use .plot.kde()
(x5.task_a / np.timedelta64(1, 'h')).plot.kde()
Details
In [149]: x5
Out[149]:
task_a
0 0 days 22:27:46.684800
1 1 days 00:20:43.036800
2 0 days 12:16:24.873600
3 1 days 11:10:14.880000
4 1 days 03:31:05.548800
5 1 days 05:20:52.944000
6 1 days 00:09:09.590400
7 0 days 13:53:50.179200
8 1 days 04:08:57.695999
9 0 days 14:14:53.088000
In [150]: x5.task_a / np.timedelta64(1, 'h') # convert to hours
Out[150]:
0 22.462968
1 24.345288
2 12.273576
3 35.170800
4 27.518208
5 29.348040
6 24.152664
7 13.897272
8 28.149360
9 14.248080
Name: task_a, dtype: float64
Or to minutes
In [151]: x5.task_a / np.timedelta64(1, 'm')
Out[151]:
0 1347.77808
1 1460.71728
2 736.41456
3 2110.24800
4 1651.09248
5 1760.88240
6 1449.15984
7 833.83632
8 1688.96160
9 854.88480
Name: task_a, dtype: float64
another way using total_seconds
In [153]: x5.task_a.dt.total_seconds() / 60
Out[153]:
0 1347.77808
1 1460.71728
2 736.41456
3 2110.24800
4 1651.09248
5 1760.88240
6 1449.15984
7 833.83632
8 1688.96160
9 854.88480
Name: task_a, dtype: float64
You can convert the TimedeltaIndex to total_seconds
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
idx = pd.date_range('20140101', '20140201')
df = pd.DataFrame(index=idx)
df['col0'] = np.random.randn(len(idx))
diff_idx = (pd.Series(((idx-
idx.shift(1)).fillna(pd.Timedelta(0))).map(pd.TimedeltaIndex.total_seconds),
index=idx)) # need to do this because we can't shift index
df['diff_dt'] = diff_idx
df['diff_dt'].plot()

Categories