Convert string hours to minute pd.eval - python

I want to convert all rows of my DataFrame that contains hours and minutes into minutes only.
I have a dataframe that looks like this:
df=
time
0 8h30
1 14h07
2 08h30
3 7h50
4 8h0
5 8h15
6 6h15
I'm using the following method to convert:
df['time'] = pd.eval(
df['time'].replace(['h'], ['*60+'], regex=True))
Output
SyntaxError: invalid syntax
I think the error comes from the format of the hour, maybe pd.evalcant accept 08h30 or 8h0, how to solve this probleme ?

Pandas can already handle such strings if the units are included in the string. While 14h07 can't be parse (why assume 07 is minutes?), 14h07 can be converted to a Timedelta :
>>> pd.to_timedelta("14h07m")
Timedelta('0 days 14:07:00')
Given this dataframe :
d1 = pd.DataFrame(['8h30m', '14h07m', '08h30m', '8h0m'],
columns=['time'])
You can convert the time series into a Timedelta series with pd.to_timedelta :
>>> d1['tm'] = pd.to_timedelta(d1['time'])
>>> d1
time tm
0 8h30m 0 days 08:30:00
1 14h07m 0 days 14:07:00
2 08h30m 0 days 08:30:00
3 8h0m 0 days 08:00:00
To handle the missing minutes unit in the original data, just append m:
d1['tm'] = pd.to_timedelta(d1['time'] + 'm')
Once you have a Timedelta you can calculate hours and minutes.
The components of the values can be retrieved with Timedelta.components
>>> d1.tm.dt.components.hours
0 8
1 14
2 8
3 8
Name: hours, dtype: int64
To get the total minutes, seconds or hours, change the frequency to minutes:
>>> d1.tm.astype('timedelta64[m]')
0 510.0
1 847.0
2 510.0
3 480.0
Name: tm, dtype: float64
Bringing all the operations together :
>>> d1['tm'] = pd.to_timedelta(d1['time'])
>>> d2 = (d1.assign(h=d1.tm.dt.components.hours,
... m=d1.tm.dt.components.minutes,
... total_minutes=d1.tm.astype('timedelta64[m]')))
>>>
>>> d2
time tm h m total_minutes
0 8h30m 0 days 08:30:00 8 30 510.0
1 14h07m 0 days 14:07:00 14 7 847.0
2 08h30m 0 days 08:30:00 8 30 510.0
3 8h0m 0 days 08:00:00 8 0 480.0

To avoid having to trim leading zeros, an alternative approach:
df[['h', 'm']] = df['time'].str.split('h', expand=True).astype(int)
df['total_min'] = df['h']*60 + df['m']
Result:
time h m total_min
0 8h30 8 30 510
1 14h07 14 7 847
2 08h30 8 30 510
3 7h50 7 50 470
4 8h0 8 0 480
5 8h15 8 15 495
6 6h15 6 15 375

Just to give an alternative approach with kind of the same elements as above you could do:
df = pd.DataFrame(data=["8h30", "14h07", "08h30", "7h50", "8h0 ", "8h15", "6h15"],
columns=["time"])
First split you column on the "h"
hm = df["time"].str.split("h", expand=True)
Then combine the columns again, but zeropad time hours and minutes in order to make valid time strings:
df2 = hm[0].str.strip().str.zfill(2) + hm[1].str.strip().str.zfill(2)
Then convert the string column with proper values to a date time column:
df3 = pd.to_datetime(df2, format="%H%M")
Finally, calculate the number of minutes by subtrackting a zero time (to make deltatimes) and divide by the minutes deltatime:
zerotime= pd.to_datetime("0000", format="%H%M")
df['minutes'] = (df3 - zerotime) / pd.Timedelta(minutes=1)
The results look like:
time minutes
0 8h30 510.0
1 14h07 847.0
2 08h30 510.0
3 7h50 470.0
4 8h0 480.0
5 8h15 495.0
6 6h15 375.0

Related

Pandas read format %D:%H:%M:%S with python

Currently I am reading in a data frame with the timestamp from film 00(days):00(hours clocks over at 24 to day):00(min):00(sec)
pandas reads time formats HH:MM:SS and YYYY:MM:DD HH:MM:SS fine.
Though is there a way of having pandas read the duration of time such as the DD:HH:MM:SS.
Alternatively using timedelta how would I go about getting the DD into HH in the data frame so that pandas can make it "1 day HH:MM:SS" for example
Data sample
00:00:00:00
00:07:33:57
02:07:02:13
00:00:13:11
00:00:10:11
00:00:00:00
00:06:20:06
01:12:13:25
Expected output for last sample
36:13:25
Thanks
If you want timedelta objects, a simple way is to replace the first colon with days :
df['timedelta'] = pd.to_timedelta(df['col'].str.replace(':', 'days ', n=1))
output:
col timedelta
0 00:00:00:00 0 days 00:00:00
1 00:07:33:57 0 days 07:33:57
2 02:07:02:13 2 days 07:02:13
3 00:00:13:11 0 days 00:13:11
4 00:00:10:11 0 days 00:10:11
5 00:00:00:00 0 days 00:00:00
6 00:06:20:06 0 days 06:20:06
7 01:12:13:25 1 days 12:13:25
>>> df.dtypes
col object
timedelta timedelta64[ns]
dtype: object
From there it's also relatively easy to combine the days and hours as string:
c = df['timedelta'].dt.components
df['str_format'] = ((c['hours']+c['days']*24).astype(str)
+df['col'].str.split('(?=:)', n=2).str[-1]).str.zfill(8)
output:
col timedelta str_format
0 00:00:00:00 0 days 00:00:00 00:00:00
1 00:07:33:57 0 days 07:33:57 07:33:57
2 02:07:02:13 2 days 07:02:13 55:02:13
3 00:00:13:11 0 days 00:13:11 00:13:11
4 00:00:10:11 0 days 00:10:11 00:10:11
5 00:00:00:00 0 days 00:00:00 00:00:00
6 00:06:20:06 0 days 06:20:06 06:20:06
7 01:12:13:25 1 days 12:13:25 36:13:25
Convert days separately, add to times and last call custom function:
def f(x):
ts = x.total_seconds()
hours, remainder = divmod(ts, 3600)
minutes, seconds = divmod(remainder, 60)
return ('{}:{:02d}:{:02d}').format(int(hours), int(minutes), int(seconds))
d = pd.to_timedelta(df['col'].str[:2].astype(int), unit='d')
td = pd.to_timedelta(df['col'].str[3:])
df['col'] = d.add(td).apply(f)
print (df)
col
0 0:00:00
1 7:33:57
2 55:02:13
3 0:13:11
4 0:10:11
5 0:00:00
6 6:20:06
7 36:13:25

Pandas and DateTime TypeError: cannot compare a TimedeltaIndex with type float

I have a pandas DataFrame Series time differences that looks like::
print(delta_t)
1 0 days 00:00:59
3 0 days 00:04:22
6 0 days 00:00:56
8 0 days 00:01:21
19 0 days 00:01:09
22 0 days 00:00:36
...
(the full DataFrame had a bunch of NaNs which I dropped).
I'd like to know which delta_t's are less than 1 day, 1 hour, 1 minute,
so I tried:
delta_t_lt1day = delta_t[np.where(delta_t < 30.)]
but then got a:
TypeError: cannot compare a TimedeltaIndex with type float
Little help?!?!
Assuming your Series is in timedelta format, you can skip the np.where, and index using something like this, where you compare your actual values to other timedeltas, using the appropriate units:
delta_t_lt1day = delta_t[delta_t < pd.Timedelta(1,'D')]
delta_t_lt1hour = delta_t[delta_t < pd.Timedelta(1,'h')]
delta_t_lt1minute = delta_t[delta_t < pd.Timedelta(1,'m')]
You'll get the following series:
>>> delta_t_lt1day
0
1 00:00:59
3 00:04:22
6 00:00:56
8 00:01:21
19 00:01:09
22 00:00:36
Name: 1, dtype: timedelta64[ns]
>>> delta_t_lt1hour
0
1 00:00:59
3 00:04:22
6 00:00:56
8 00:01:21
19 00:01:09
22 00:00:36
Name: 1, dtype: timedelta64[ns]
>>> delta_t_lt1minute
0
1 00:00:59
6 00:00:56
22 00:00:36
Name: 1, dtype: timedelta64[ns]
You could use the TimeDelta class:
import pandas as pd
deltas = pd.to_timedelta(['0 days 00:00:59',
'0 days 00:04:22',
'0 days 00:00:56',
'0 days 00:01:21',
'0 days 00:01:09',
'0 days 00:31:09',
'0 days 00:00:36'])
for e in deltas[deltas < pd.Timedelta(value=30, unit='m')]:
print(e)
Output
0 days 00:00:59
0 days 00:04:22
0 days 00:00:56
0 days 00:01:21
0 days 00:01:09
0 days 00:00:36
Note that this filter outs '0 days 00:31:09' as expected. The expression pd.Timedelta(value=30, unit='m') creates a time delta of 30 minutes.

Times read in as time deltas with large number of days in front

I'm working with a data set of my sleep for the past year or so. I've read the CSV into a pandas Dataframe. In it is a column called 'Duration'. I convert it into a timeDelta as follows:
df.Duration = pd.to_timedelta(df.Duration)
df.Duration.head()
Which outputs
0 17711 days 08:27:00
1 17711 days 07:56:00
2 17711 days 04:22:00
3 17711 days 07:29:00
4 17711 days 06:46:00
Name: Duration, dtype: timedelta64[ns]
I sort of understand why I get 17711 days in front of the hours, but all I really want is the hours. To solve this, I could write
df.Duration = (df.Duration - pd.Timedelta('17711 days'))
Which gives me
0 08:27:00
1 07:56:00
2 04:22:00
3 07:29:00
4 06:46:00
Name: Duration, dtype: timedelta64[ns]
However this is a pretty brittle method. Is there a better method of getting just the hours I want?
datetime.timdelta objects store days, seconds and microseconds as attributes. We can access them in a pandas.DataFrame with dt:
Setting up some dummy data
import datetime as dt
import pandas as pd
df = pd.DataFrame(
data=(
dt.timedelta(days=17711, hours=i, minutes=i, seconds=i) for i in range(0, 10)
),
columns=['Duration']
)
print(df['Duration'])
Duration
0 17711 days 00:00:00
1 17711 days 01:01:01
2 17711 days 02:02:02
3 17711 days 03:03:03
4 17711 days 04:04:04
5 17711 days 05:05:05
6 17711 days 06:06:06
7 17711 days 07:07:07
8 17711 days 08:08:08
9 17711 days 09:09:09
Name: Duration, dtype: timedelta64[ns]
Accesing seconds and turning them into hours
print(df['Duration'].dt.seconds / 3600)
0 0.000000
1 1.016944
2 2.033889
3 3.050833
4 4.067778
5 5.084722
6 6.101667
7 7.118611
8 8.135556
9 9.152500
Name: Duration, dtype: float64
Only hours
print(df['Duration'].dt.seconds // 3600)
0 0
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
Name: Duration, dtype: int64
Using split() with regex should do what you're looking for I think:
df[['Days', 'Time']] = df['Duration'].str.split('.* days', expand=True)
This will split the column into two and then you can just call it using the "Time" key.
Code:
>>> import pandas as pd
>>> d = ['17711 days 08:27:00',
... '17711 days 07:56:00',
... '17711 days 04:22:00',
... '17711 days 07:29:00',
... '17711 days 06:46:00']
>>> df = pd.DataFrame({'Duration': d})
>>> df[['Days', 'Time']] = df['Duration'].str.split('.* days', expand=True)
>>> df.Time = pd.to_timedelta(df.Time)
>>> df.Time.head()
0 08:27:00
1 07:56:00
2 04:22:00
3 07:29:00
4 06:46:00
Name: Time, dtype: timedelta64[ns]

pandas resample timed events in DataFrame to precise time-bins

maybe I could not find it... anyhow, with pandas '0.19.2' there is the following
problem:
I have some timed events of associated groups which can be generated by:
from numpy.random import randint, seed
import pandas as pd
seed(42) # reproducibility
samp_N = 1000
# create times within 3 hours, and 15 random groups
df = pd.DataFrame({'time': randint(0,3*60*60, samp_N),
'group': randint(0, 15, samp_N)})
# make a resample-able index from the seconds time values
df.set_index(pd.TimedeltaIndex(df.time, 's'), inplace=True)
which looks like:
group time
02:01:10 10 7270
00:14:20 13 860
01:29:50 9 5390
01:26:31 13 5191
...
When I try to resample the events, I get something undesirable
df.resample('5T').count()
group time
00:00:04 28 28
00:05:04 18 18
00:10:04 32 32
...
Unfortunately the resampling periods start at arbitrary (first in data) offset values.
It is even more annoying if I group this (as ultimately required)
df.groupby('group').resample('5T').count()
then I get a new offset for each group
what I want is the precise start of sampling windows:
00:00:00 5 ...
00:05:00 17 ...
00:10:00 11 ...
...
there was a suggestion in: https://stackoverflow.com/a/23966229
df.groupby(pd.TimeGrouper('5Min')).count()
but it does not work either, as it also ruins the grouping required above.
thanks for hints!
Unfortunately i didn't come up with a nice solution but rather a work around. I added a dummy row with time value zero and then grouped by time and group:
df = pd.Series({'time':0,'group':-1}).to_frame().T.set_index(pd.TimedeltaIndex([0], 's')).append(df)
df = df.groupby([pd.Grouper(freq='5Min'), 'group']).count().reset_index('group')
df = df.loc[df['group']!=-1]
df.head()
group time
0 days 0 2
0 days 1 4
0 days 2 3
0 days 3 1
0 days 4 2
I am not sure this is the result you want:
result = df.groupby(['group', pd.Grouper(freq='5Min')]).count().reset_index(level=0)
result.head()
>>> group time
00:05:00 0 2
00:10:00 0 1
00:15:00 0 3
00:20:00 0 2
00:30:00 0 1
result.sort_index().head()
>>> group time
0 days 10 1
0 days 14 3
0 days 2 1
0 days 13 1
0 days 4 3

Dividing a series containing datetime by a series containing an integer in Pandas

I have a series s1 which is of type datetime and has a time which represents a range between a start time and an end time - typical values are 7 days, 4 hours 5 mins etc. I have series s2 which contains integers for the number of events that happened in that time range.
I want to calculate the event frequency by:
event_freq = s1 / s2
I get the error:
cannot operate on a series with out a rhs of a series/ndarray of type datetime64[ns] or a timedelta
Whats the best way to fix this?
Thanks in advance!
EXAMPLE of s1 is:
some_id
1 2012-09-02 09:18:40
3 2012-04-02 09:36:39
4 2012-02-02 09:58:02
5 2013-02-09 14:31:52
6 2012-01-09 12:59:20
EXAMPLE of s2 is:
some_id
1 3
3 1
4 1
5 2
6 1
8 1
10 3
12 2
This might possibly be a bug but what works is to operate on the underlying numpy array like so:
import pandas as pd
from pandas import Series
startdate = Series(pd.date_range('2013-01-01', '2013-01-03'))
enddate = Series(pd.date_range('2013-03-01', '2013-03-03'))
s1 = enddate - startdate
s2 = Series([2, 3, 4])
event_freq = Series(s1.values / s2)
Here are the Series:
>>> s1
0 59 days, 00:00:00
1 59 days, 00:00:00
2 59 days, 00:00:00
dtype: timedelta64[ns]
>>> s2
0 2
1 3
2 4
dtype: int64
>>> event_freq
0 29 days, 12:00:00
1 19 days, 16:00:00
2 14 days, 18:00:00
dtype: timedelta64[ns]

Categories