I have a column with timedelta and I would like to create an extra column extracting the hour and minute from the timedelta column.
df
time_delta hour_minute
02:51:21.401000 2h:51min
03:10:32.401000 3h:10min
08:46:43.401000 08h:46min
This is what I have tried so far:
df['rh'] = df.time_delta.apply(lambda x: round(pd.Timedelta(x).total_seconds() \
% 86400.0 / 3600.0) )
Unfortunately, I'm not quite sure how to extract the minutes without incl. the hour
Use Series.dt.components for get hours and minutes and join together:
td = pd.to_timedelta(df.time_delta).dt.components
df['rh'] = (td.hours.astype(str).str.zfill(2) + 'h:' +
td.minutes.astype(str).str.zfill(2) + 'min')
print (df)
time_delta hour_minute rh
0 02:51:21.401000 2h:51min 02h:51min
1 03:10:32.401000 3h:10min 03h:10min
2 08:46:43.401000 08h:46min 08h:46min
If possible values of hour are more like 24hours is necessary also add days:
print (df)
time_delta hour_minute
0 02:51:21.401000 2h:51min
1 03:10:32.401000 3h:10min
2 28:46:43.401000 28h:46min
td = pd.to_timedelta(df.time_delta).dt.components
print (td)
days hours minutes seconds milliseconds microseconds nanoseconds
0 0 2 51 21 401 0 0
1 0 3 10 32 401 0 0
2 1 4 46 43 401 0 0
df['rh'] = ((td.days * 24 + td.hours).astype(str).str.zfill(2) + 'h:' +
td.minutes.astype(str).str.zfill(2) + 'min')
print (df)
time_delta hour_minute rh
0 02:51:21.401000 2h:51min 02h:51min
1 03:10:32.401000 3h:10min 03h:10min
2 28:46:43.401000 28h:46min 28h:46min
See also this post which defines the function
def strfdelta(tdelta, fmt):
d = {"days": tdelta.days}
d["hours"], rem = divmod(tdelta.seconds, 3600)
d["minutes"], d["seconds"] = divmod(rem, 60)
return fmt.format(**d)
Then, e.g.
strfdelta(pd.Timedelta('02:51:21.401000'), '{hours}h:{minutes}min')
gives '2h:51min'.
For your full dataframe
df['rh'] = df.time_delta.apply(lambda x: strfdelta(pd.Timedelta(x), '{hours}h:{minutes}min'))
Related
I want to exclude some period in my times series:
from 2.am till 6 a.m
How can I fix that ?
Thank you for your help !
import pandas as pd
start = pd.Timestamp("2022-10-03")
end = pd.Timestamp("2022-11-13")
N = 25
t = np.random.randint(start.value, end.value, N)
t -= t % 1000000000
start = pd.to_datetime(np.array(t, dtype="datetime64[ns]"))
duration = pd.to_timedelta(np.random.randint(100, 10000, N), unit="s")
df = pd.DataFrame({"start":start, "duration":duration})
df["end"] = df.start + df.duration```
start duration end
0 2022-10-06 21:17:16 0 days 00:25:55 2022-10-06 21:43:11
1 2022-10-27 08:20:47 0 days 00:04:32 2022-10-27 08:25:19
2 2022-10-09 16:34:08 0 days 01:53:24 2022-10-09 18:27:32
3 2022-10-08 16:16:26 0 days 00:16:35 2022-10-08 16:33:01
I have a dataframe with a list of time value as object and needed to convert them to datetime, the issue is, they are not on the same format so when I try:
df['Total call time'] = pd.to_datetime(df['Total call time'], format='%H:%M:%S')
it gives me an error
ValueError: time data '3:22' does not match format '%H:%M:%S' (match)
or if use this code
df['Total call time'] = pd.to_datetime(df['Total call time'], format='%H:%M')
I get this error
ValueError: unconverted data remains: :58
These are the values on my data
Total call time
2:04:07
3:22:41
2:30:41
2:19:06
1:45:55
1:30:08
1:32:15
1:43:28
**45:48**
1:41:40
5:08:37
**3:22**
4:29:05
2:47:25
2:39:29
2:29:32
2:09:52
3:31:57
2:27:58
2:34:28
3:14:10
2:12:10
2:46:58
times = """\
2:04:07
3:22:41
2:30:41
2:19:06
1:45:55
1:30:08
1:32:15
1:43:28
45:48
1:41:40
5:08:37
3:22
4:29:05
2:47:25
2:39:29
2:29:32
2:09:52
3:31:57
2:27:58
2:34:28
3:14:10
2:12:10
2:46:58""".split()
import pandas as pd
df = pd.DataFrame(times, columns=['elapsed'])
def pad(s):
if len(s) == 4:
return '00:0'+s
elif len(s) == 5:
return '00:'+s
return s
print(pd.to_timedelta(df['elapsed'].apply(pad)))
Output:
0 0 days 02:04:07
1 0 days 03:22:41
2 0 days 02:30:41
3 0 days 02:19:06
4 0 days 01:45:55
5 0 days 01:30:08
6 0 days 01:32:15
7 0 days 01:43:28
8 0 days 00:45:48
9 0 days 01:41:40
10 0 days 05:08:37
11 0 days 00:03:22
12 0 days 04:29:05
13 0 days 02:47:25
14 0 days 02:39:29
15 0 days 02:29:32
16 0 days 02:09:52
17 0 days 03:31:57
18 0 days 02:27:58
19 0 days 02:34:28
20 0 days 03:14:10
21 0 days 02:12:10
22 0 days 02:46:58
Name: elapsed, dtype: timedelta64[ns]
Alternatively to grovina's answer ... instead of using apply you can directly use the dt accessor.
Here's a sample:
>>> data = [['2017-12-01'], ['2017-12-
30'],['2018-01-01']]
>>> df = pd.DataFrame(data=data,
columns=['date'])
>>> df
date
0 2017-12-01
1 2017-12-30
2 2018-01-01
>>> df.date
0 2017-12-01
1 2017-12-30
2 2018-01-01
Name: date, dtype: object
Note how df.date is an object? Let's turn it into a date like you want
>>> df.date = pd.to_datetime(df.date)
>>> df.date
0 2017-12-01
1 2017-12-30
2 2018-01-01
Name: date, dtype: datetime64[ns]
The format you want is for string formatting. I don't think you'll be able to convert the actual datetime64 to look like that format. For now, let's make a newly formatted string version of your date in a separate column
>>> df['new_formatted_date'] =
df.date.dt.strftime('%d/%m/%y %H:%M')
>>> df.new_formatted_date
0 01/12/17 00:00
1 30/12/17 00:00
2 01/01/18 00:00
Name: new_formatted_date, dtype: object
Finally, since the df.date column is now of date datetime64... you can use the dt accessor right on it. No need to use apply
>>> df['month'] = df.date.dt.month
>>> df['day'] = df.date.dt.day
>>> df['year'] = df.date.dt.year
>>> df['hour'] = df.date.dt.hour
>>> df['minute'] = df.date.dt.minute
>>> df
date new_formatted_date month day
year hour minute
0 2017-12-01 01/12/17 00:00 12
1 2017 0 0
1 2017-12-30 30/12/17 00:00 12
30 2017 0 0
2 2018-01-01 01/01/18 00:00
Another idea is test if double : and if not added :00 with converting to timedeltas by to_timedelta, also is test if number before first : is less like 23 - then is parsing like HH:MM, if is greater is parising like MM:SS:
m1 = df['Total call time'].str.count(':').ne(2)
m2 = df['Total call time'].str.extract('^(\d+):', expand=False).astype(float).gt(23)
s = np.select([m1 & m2, m1 & ~m2],
['00:' + df['Total call time'], df['Total call time']+ ':00'],
df['Total call time'] )
df['Total call time'] = pd.to_timedelta(s)
print (df)
Total call time
0 0 days 02:04:07
1 0 days 03:22:41
2 0 days 02:30:41
3 0 days 02:19:06
4 0 days 01:45:55
5 0 days 01:30:08
6 0 days 01:32:15
7 0 days 01:43:28
8 0 days 00:45:48
9 0 days 01:41:40
10 0 days 05:08:37
11 0 days 03:22:00
12 0 days 04:29:05
13 0 days 02:47:25
14 0 days 02:39:29
15 0 days 02:29:32
16 0 days 02:09:52
17 0 days 03:31:57
18 0 days 02:27:58
19 0 days 02:34:28
20 0 days 03:14:10
21 0 days 02:12:10
22 0 days 02:46:58
I have dataframe like as below
cust_id,purchase_date
1,10/01/1998
1,10/12/1999
2,13/05/2016
3,14/02/2018
3,15/03/2019
I would like to do the below
a) display the output in text format as 5 years and 9 months instead of 5.93244 etc.
I tried the below
from datetime import timedelta
df['purchase_date'] = pd.to_datetime(df['purchase_date'])
gb = df_new.groupby(['unique_key'])
df_cust_age = gb['purchase_date'].agg(min_date=np.min, max_date=np.max).reset_index()
df_cust_age['diff_in_days'] = df_cust_age['max_date'] - df_cust_age['min_date']
df_cust_age['years_diff'] = df_cust_age['diff_in_days']/timedelta(days=365)
but the above code gives the output in decimal numbers.
I expect my output to be like as below
cust_id,years_diff
1, 1 years and 11 months and 0 day
2, 0 years
3, 1 year and 1 month and 1 day
If possible create 'default' month with 30 days use this custom function:
#https://stackoverflow.com/a/13756038/2901002
def td_format(td_object):
seconds = int(td_object.total_seconds())
periods = [
('year', 60*60*24*365),
('month', 60*60*24*30),
('day', 60*60*24),
('hour', 60*60),
('minute', 60),
('second', 1)
]
strings=[]
for period_name, period_seconds in periods:
if seconds > period_seconds:
period_value , seconds = divmod(seconds, period_seconds)
has_s = 's' if period_value > 1 else ''
strings.append("%s %s%s" % (period_value, period_name, has_s))
return ", ".join(strings) if len(strings) > 0 else '0 year'
df_cust_age['years_diff'] = df_cust_age['diff_in_days'].apply(td_format)
print (df_cust_age)
cust_id min_date max_date diff_in_days years_diff
0 1 1998-10-01 1999-10-12 376 days 1 year, 11 days
1 2 2016-05-13 2016-05-13 0 days 0 year
2 3 2018-02-14 2019-03-15 394 days 1 year, 29 days
from io import StringIO
import pandas as pd
from dateutil.relativedelta import relativedelta as RD
string_data = '''unique_key,purchase_date
1,10/01/1998
1,10/12/1999
2,13/05/2016
3,14/02/2018
3,15/03/2019'''
## Custom functions
diff_obj = lambda d1,d2:RD(d1, d2) if d1>d2 else RD(d2, d1)
date_tuple = lambda diff:(diff.years,diff.months,diff.days)
pipeline = lambda row:date_tuple(diff_obj(row['min_date'],row['max_date']))
def string_format(date_tuple):
final_string = []
for val,name in zip(date_tuple,['years','months','day']):
if val:
final_string.append(f'{val} {name}')
return ' and '.join(final_string) if final_string else '0 years'
## Custom functions
df = pd.read_csv(StringIO(string_data))
df['purchase_date'] = pd.to_datetime(df['purchase_date'],format='%d/%m/%Y')
gb = df.groupby(['unique_key'])
df_cust_age = gb['purchase_date'].agg(min_date=np.min, max_date=np.max).reset_index()
df_cust_age['years_diff'] = df_cust_age.apply(pipeline,axis=1).apply(string_format)
print(df_cust_age)
unique_key min_date max_date years_diff
0 1 1998-01-10 1999-12-10 1 years and 11 months
1 2 2016-05-13 2016-05-13 0 years
2 3 2018-02-14 2019-03-15 1 years and 1 months and 1 day
Currently I am reading in a data frame with the timestamp from film 00(days):00(hours clocks over at 24 to day):00(min):00(sec)
pandas reads time formats HH:MM:SS and YYYY:MM:DD HH:MM:SS fine.
Though is there a way of having pandas read the duration of time such as the DD:HH:MM:SS.
Alternatively using timedelta how would I go about getting the DD into HH in the data frame so that pandas can make it "1 day HH:MM:SS" for example
Data sample
00:00:00:00
00:07:33:57
02:07:02:13
00:00:13:11
00:00:10:11
00:00:00:00
00:06:20:06
01:12:13:25
Expected output for last sample
36:13:25
Thanks
If you want timedelta objects, a simple way is to replace the first colon with days :
df['timedelta'] = pd.to_timedelta(df['col'].str.replace(':', 'days ', n=1))
output:
col timedelta
0 00:00:00:00 0 days 00:00:00
1 00:07:33:57 0 days 07:33:57
2 02:07:02:13 2 days 07:02:13
3 00:00:13:11 0 days 00:13:11
4 00:00:10:11 0 days 00:10:11
5 00:00:00:00 0 days 00:00:00
6 00:06:20:06 0 days 06:20:06
7 01:12:13:25 1 days 12:13:25
>>> df.dtypes
col object
timedelta timedelta64[ns]
dtype: object
From there it's also relatively easy to combine the days and hours as string:
c = df['timedelta'].dt.components
df['str_format'] = ((c['hours']+c['days']*24).astype(str)
+df['col'].str.split('(?=:)', n=2).str[-1]).str.zfill(8)
output:
col timedelta str_format
0 00:00:00:00 0 days 00:00:00 00:00:00
1 00:07:33:57 0 days 07:33:57 07:33:57
2 02:07:02:13 2 days 07:02:13 55:02:13
3 00:00:13:11 0 days 00:13:11 00:13:11
4 00:00:10:11 0 days 00:10:11 00:10:11
5 00:00:00:00 0 days 00:00:00 00:00:00
6 00:06:20:06 0 days 06:20:06 06:20:06
7 01:12:13:25 1 days 12:13:25 36:13:25
Convert days separately, add to times and last call custom function:
def f(x):
ts = x.total_seconds()
hours, remainder = divmod(ts, 3600)
minutes, seconds = divmod(remainder, 60)
return ('{}:{:02d}:{:02d}').format(int(hours), int(minutes), int(seconds))
d = pd.to_timedelta(df['col'].str[:2].astype(int), unit='d')
td = pd.to_timedelta(df['col'].str[3:])
df['col'] = d.add(td).apply(f)
print (df)
col
0 0:00:00
1 7:33:57
2 55:02:13
3 0:13:11
4 0:10:11
5 0:00:00
6 6:20:06
7 36:13:25
I have a Data Frame that looks like this:
df
Date Hr CO2_resp
0 5/1/02 600 0.000889
1 5/2/02 600 0.000984
2 5/4/02 900 0.000912
How would I go about creating a column Ind that represents a number index of hours elapsed since midnight 5/1/02? Such that the column would read
df
Date Hr Ind CO2_resp
0 5/1/02 600 6 0.000889
1 5/2/02 600 30 0.000984
2 5/4/02 800 80 0.000912
Thanks.
You can use to_datetime with to_timedelta. Then convert timedelta to hours by np.timedelta64(1, 'h') and last if type of output is always int, cast by astype:
#convert column Date to datetime
df['Date'] = pd.to_datetime(df.Date)
df['Ind'] = ((df.Date
- pd.to_datetime('2002-05-01')
+ pd.to_timedelta(df.Hr / 100, unit='h')) / np.timedelta64(1, 'h')).astype(int)
print (df)
Date Hr CO2_resp ind
0 2002-05-01 600 0.000889 6
1 2002-05-02 600 0.000984 30
2 2002-05-04 900 0.000912 81
If not dividing by 100 column Hr, output is different:
df['Ind'] = ((df.Date
- pd.to_datetime('2002-05-01')
+ pd.to_timedelta(df.Hr,unit='h')) / np.timedelta64(1, 'h')).astype(int)
print (df)
Date Hr CO2_resp Ind
0 2002-05-01 600 0.000889 600
1 2002-05-02 600 0.000984 624
2 2002-05-04 900 0.000912 972
Assuming that the Date is a string, and Hr is an integer, you could apply a function to parse the Date, get the hours (days * 24) from the timedelta with your reference date, and add the hours.
Something like this -
df.apply(lambda x:
(datetime.datetime.strptime(x['Date'], '%m/%d/%y')
- datetime.datetime.strptime('5/1/02', '%m/%d/%y')).days
* 24 + x['Hr'] / 100,
axis=1)