Convert string to timedelta in pandas - python

I have a series where the timestamp is in the format HHHHH:MM:
timestamp = pd.Series(['34:23', '125:26', '15234:52'], index=index)
I would like to convert it to a timedelta series.
For now I manage to do that on a single string:
str[:-3]
str[-2:]
timedelta(hours=int(str[:-3]),minutes=int(str[-2:]))
I would like to apply it to the whole series, if possible in a cleaner way. Is there a way to do this?

You can use column-wise Pandas methods:
s = pd.Series(['34:23','125:26','15234:52'])
v = s.str.split(':', expand=True).astype(int)
s = pd.to_timedelta(v[0], unit='h') + pd.to_timedelta(v[1], unit='m')
print(s)
0 1 days 10:23:00
1 5 days 05:26:00
2 634 days 18:52:00
dtype: timedelta64[ns]
As pointed out in comments, this can also be achieved in one line, albeit less clear:
s = pd.to_timedelta((s.str.split(':', expand=True).astype(int) * (60, 1)).sum(axis=1), unit='min')

This is how I would do it:
timestamp = pd.Series(['34:23','125:26','15234:52'])
x = timestamp.str.split(":").apply(lambda x: int(x[0])*60 + int(x[1]))
timestamp = pd.to_timedelta(x, unit='s')

Parse the delta in seconds as an argument to pd.to_timedelta like this,
In [1]: import pandas as pd
In [2]: ts = pd.Series(['34:23','125:26','15234:52'])
In [3]: secs = 60 * ts.apply(lambda x: 60*int(x[:-3]) + int(x[-2:]))
In [4]: pd.to_timedelta(secs, 's')
Out[4]:
0 1 days 10:23:00
1 5 days 05:26:00
2 634 days 18:52:00
dtype: timedelta64[ns]
Edit: missed erncyp's answer which would work as well but you need to multiply the argument to pd.to_timedelta by 60 since if I recall correctly minutes aren't an available as a measure of elapsed time except modulo the previous hour.

You can use pandas.Series.apply, i.e.:
def convert(args):
return timedelta(hours=int(args[:-3]),minutes=int(args[-2:]))
s = pd.Series(['34:23','125:26','15234:52'])
s = s.apply(convert)

Related

Convertng a Pandas series of stringg of '%d:%H:%M:%S' format into datetime format

I have a Pandas series which consists of strings of '169:21:5:24', '54:9:19:29', and so on which stand for 169 days 21 hours 5 minutes 24 seconds and 54 days 9 hours 19 minutes 29 seconds, respectively.
I want to convert them to datetime object (preferable) or just integers of seconds.
The first try was
pd.to_datetime(series1, format = '%d:%H:%M:%S')
which failed with an error message
time data '169:21:5:24' does not match format '%d:%H:%M:%S' (match)
The second try
pd.to_datetime(series1)
also failed with
expected hh:mm:ss format
The first try seems to work if all the 'days' are less than 30 or 31 days, but my data includes 150 days, 250 days etc and with no month value.
Finally,
temp_list1 = [[int(subitem) for subitem in item.split(":")] for item in series1]
temp_list2 = [item[0] * 24 * 3600 + item[1] * 3600 + item[2] * 60 + item[3] for item in temp_list1]
successfully converted the Series into a list of seconds, but this is lengthy.
I wonder if there is a Pandas.Series.dt or datetime methods that can deal with such type of data.
I want to convert them to datetime object (preferable) or just integers of seconds.
It seems to me like you are rather looking for a timedelta because it's unclear what the year should be?
You could do that for example by (ser your series):
ser = pd.Series(["169:21:5:24", "54:9:19:29"])
timedeltas = ser.str.split(":", n=1, expand=True).assign(td=lambda df:
pd.to_timedelta(df[0].astype("int"), unit="D") + pd.to_timedelta(df[1])
)["td"]
seconds = timedeltas.dt.total_seconds().astype("int")
datetimes = pd.Timestamp("2022") + timedeltas # year has to be provided
Result:
timedeltas:
0 169 days 21:05:24
1 54 days 09:19:29
Name: td, dtype: timedelta64[ns]
seconds:
0 14677524
1 4699169
Name: td, dtype: int64
datetimes:
0 2022-06-19 21:05:24
1 2022-02-24 09:19:29
Name: td, dtype: datetime64[ns]
[PyData.Pandas]: pandas.to_datetime uses (and points to) [Python.Docs]: datetime - strftime() and strptime() Behavior which states (emphasis is mine):
%d - Day of the month as a zero-padded decimal number.
...
%j - Day of the year as a zero-padded decimal number.
So, you're using the wrong directive (correct one is %j):
>>> import pandas as pd
>>>
>>> pd.to_datetime("169:21:5:24", format="%j:%H:%M:%S")
Timestamp('1900-06-18 21:05:24')
As seen, the reference year is 1900 (as specified in the 2nd URL). If you want to use the current year, a bit of extra processing is required:
>>> import datetime
>>>
>>> cur_year_str = "{:04d}:".format(datetime.datetime.today().year)
>>> cur_year_str
'2023:'
>>>
>>> pd.to_datetime(cur_year_str + "169:21:5:24", format="%Y:%j:%H:%M:%S")
Timestamp('2023-06-18 21:05:24')
>>>
>>> # Quick leap year test
>>> pd.to_datetime("2020:169:21:5:24", format="%Y:%j:%H:%M:%S")
Timestamp('2020-06-17 21:05:24')
All in all:
>>> series = pd.Series(("169:21:5:24", "54:9:19:29"))
>>> pd.to_datetime(year_str + series, format="%Y:%j:%H:%M:%S")
0 2023-06-18 21:05:24
1 2023-02-23 09:19:29
dtype: datetime64[ns]

How do I find the median in the DataFrame column?

df['diff']
23:59:01
23:59:13
23:59:17
23:59:27
23:59:52
hh-mm-ss data is obtained after calculating the difference between sessions via TimesDelta.
Converted time into seconds and found the median. How do I find the median in hh-mm-ss format?
The diff column need to be converted to numerical seconds.
import pandas as pd
def time2sec(t):
(h, m, s) = t.split(':')
return int(h) * 3600 + int(m) * 60 + int(s)
df = pd.DataFrame(['23:59:01','23:59:13','23:59:17','23:59:27','23:59:52'],columns=['diff'])
df['diff_sec'] = df['diff'].map(time2sec)
print(df)
median = df['diff_sec'].median()
print('median :',median)
diff diff_sec
0 23:59:01 86341
1 23:59:13 86353
2 23:59:17 86357
3 23:59:27 86367
4 23:59:52 86392
86357.0
If your data is already in Timedelta format as you mentioned, you can just use df.median() to get the median of the series.
You can try:
pd.to_timedelta(df['diff']).median()
pd.to_timedelta converts the date string to Timedelta. Then, we can use Series.median() to get the median.
Result:
Timedelta('0 days 23:59:17')

Dataframe - mean of string type column with time values

I have to calculate mean() of time column, but this column type is string, how can I do it?
id time
1 1h:2m
2 1h:58m
3 35m
4 2h
...
You can use regex to extract hours and minutes. To calcualte the mean time in minutus:
h = df['time'].str.extract('(\d{1,2})h').fillna(0).astype(int)
m = df['time'].str.extract('(\d{1,2})m').fillna(0).astype(int)
(h * 60 + m).mean()
Result:
0 83.75
dtype: float64
It's largely inspired from How to construct a timedelta object from a simple string, but you can do as below:
def convertToSecond(time_str):
regex=re.compile(r'((?P<hours>\d+?)h)?:*((?P<minutes>\d+?)m)?:*((?P<seconds>\d+?)s)?')
parts = regex.match(time_str)
if not parts:
return
parts = parts.groupdict()
time_params = {}
for (name, param) in parts.items():
if param:
time_params[name] = int(param)
return timedelta(**time_params).total_seconds()
df = pd.DataFrame({
'time': ['1h:2m', '1h:58m','35m','2h'],})
df['inSecond']=df['time'].apply(convertToSecond)
mean_inSecond=df['inSecond'].mean()
print(f"Mean of Time Column: {datetime.timedelta(seconds=mean_inSecond)}")
Result:
Mean of Time Column: 1:23:45
Another possibility is to convert your string column into timedelta (since they don't seem to be times but rather durations?).
Since your strings are not all formatted equally, you unfortinately cannot use pandas' to_timedelta function. However, parser from dateutil has an option fuzzy that you can use to convert your column to datetime. If you subtract midnight today from that, you get the value as a timedelta.
import pandas as pd
from dateutil import parser
from datetime import date
from datetime import datetime
df = pd.DataFrame([[1,'1h:2m'],[2,'1h:58m'],[3,'35m'],[4,'2h']],columns=['id','time'])
today = date.today()
midnight = datetime.combine(today, datetime.min.time())
df['time'] = df['time'].apply(lambda x: (parser.parse(x, fuzzy=True)) - midnight)
This will convert your dataframe like this (print(df)):
id time
0 1 01:02:00
1 2 01:58:00
2 3 00:35:00
3 4 02:00:00
from which you can calculate the mean using print(df['time'].mean()):
0 days 01:23:45
Full example: https://ideone.com/Aze9mR

Why i cant add or substract two date times?

I have a df with the time and with the milliseconds in another columns like this:
Time ms
0 14:11:52 0
1 4:11:52 250
1 4:11:52 500
1 4:11:52 750
I want to add the milliseconds to the time like this:
Time
0 14:11:52
1 4:11:52:250
1 4:11:52:500
1 4:11:52:750
I tried converting both to datetime[ns] and [D] but I get the following error: cannot add DatetimeArray and DatetimeArrayt
df['Time'] = pd.to_datetime(df['Time'], format='%H:%M:%S')
df['ms'] = pd.to_datetime(df['ms'], format='%f')
df['Time'] = df['Time'] + df['ms']
I think that by using a time delta is possible to achieve what I want, but is there a cleaner way to just add one date column with another one?
IIUC two to_timedelta
pd.to_timedelta(df.Time)+pd.to_timedelta(df.ms,unit='ms')
Out[72]:
0 14:11:52
1 04:11:52.250000
1 04:11:52.500000
1 04:11:52.750000
dtype: timedelta64[ns]
df['Time']=pd.to_timedelta(df.Time)+pd.to_timedelta(df.ms,unit='ms')
Pandas' time mangling principle is simple:
datetime - datetime = timedelta
datetime + timedelta = datetime
The rest of the combinations will not work at all, or at least not as expected.

datetime conversion ValueError Pandas

I have a dataset with wrong times (24:00:00 to 26:18:00) I wanted to know what is the best approach to deal with this kind of data in python.
I tried to convert the column from object to datetime using this code:
stopTimeArrDep['departure_time'] = pd.to_datetime(stopTimeArrDep['departure_time']\
,format='%H:%M:%S')
But I get this error:
ValueError: time data '24:04:00' does not match format '%H:%M:%S' (match)
So I tried adding errors='coerce' to avoid this error. But I end up with empty columns and unwanted date added to every row.
stopTimeArrDep['departure_time'] = pd.to_datetime(stopTimeArrDep['departure_time']\
,format='%H:%M:%S',errors='coerce')
output sample:
original_col converted_col
23:45:00 1/1/00 23:45:00
23:51:00 1/1/00 23:51:00
24:04:00
23:42:00 1/1/00 23:42:00
26:01:00
Any suggestion on what is the best approach to handle this issue. Thank you,
Solution
You could treat the original_col as some elapsed time interval and not time, if that makes any sense. You could use datetime.timedelta and then add this datetime.timedelta to a datetime.datetime to get some datetime object; which you could finally use to get the date and time separately.
Example
from datetime import datetime, timedelta
time_string = "20:30:20"
t = datetime.utcnow()
print('t: {}'.format(t))
HH, MM, SS = [int(x) for x in time_string.split(':')]
dt = timedelta(hours=HH, minutes=MM, seconds=SS)
print('dt: {}'.format(dt))
t2 = t + dt
print('t2: {}'.format(t2))
print('t2.date: {} | t2.time: {}'.format(str(t2.date()), str(t2.time()).split('.')[0]))
Output:
t: 2019-10-24 04:43:08.255027
dt: 20:30:20
t2: 2019-10-25 01:13:28.255027
t2.date: 2019-10-25 | t2.time: 01:13:28
For Your Usecase
# Define Custom Function
def process_row(time_string):
HH, MM, SS = [int(x) for x in time_string.split(':')]
dt = timedelta(hours=HH, minutes=MM, seconds=SS)
return dt
# Make Dummy Data
original_col = ["23:45:00", "23:51:00", "24:04:00", "23:42:00", "26:01:00"]
df = pd.DataFrame({'original_col': original_col, 'dt': None})
# Process Dataframe
df['dt'] = df.apply(lambda x: process_row(x['original_col']), axis=1)
df['t'] = datetime.utcnow()
df['t2'] = df['dt'] + df['t']
# extracting date from timestamp
df['Date'] = [datetime.date(d) for d in df['t2']]
# extracting time from timestamp
df['Time'] = [datetime.time(d) for d in df['t2']]
df
Output:
Using pandas.to_datetime():
pd.to_datetime(df['t2'], format='%H:%M:%S',errors='coerce')
Output:
0 2019-10-25 09:38:39.349410
1 2019-10-25 09:44:39.349410
2 2019-10-25 09:57:39.349410
3 2019-10-25 09:35:39.349410
4 2019-10-25 11:54:39.349410
Name: t2, dtype: datetime64[ns]
References
How to construct a timedelta object from a simple string

Categories