I have a dataset with wrong times (24:00:00 to 26:18:00) I wanted to know what is the best approach to deal with this kind of data in python.
I tried to convert the column from object to datetime using this code:
stopTimeArrDep['departure_time'] = pd.to_datetime(stopTimeArrDep['departure_time']\
,format='%H:%M:%S')
But I get this error:
ValueError: time data '24:04:00' does not match format '%H:%M:%S' (match)
So I tried adding errors='coerce' to avoid this error. But I end up with empty columns and unwanted date added to every row.
stopTimeArrDep['departure_time'] = pd.to_datetime(stopTimeArrDep['departure_time']\
,format='%H:%M:%S',errors='coerce')
output sample:
original_col converted_col
23:45:00 1/1/00 23:45:00
23:51:00 1/1/00 23:51:00
24:04:00
23:42:00 1/1/00 23:42:00
26:01:00
Any suggestion on what is the best approach to handle this issue. Thank you,
Solution
You could treat the original_col as some elapsed time interval and not time, if that makes any sense. You could use datetime.timedelta and then add this datetime.timedelta to a datetime.datetime to get some datetime object; which you could finally use to get the date and time separately.
Example
from datetime import datetime, timedelta
time_string = "20:30:20"
t = datetime.utcnow()
print('t: {}'.format(t))
HH, MM, SS = [int(x) for x in time_string.split(':')]
dt = timedelta(hours=HH, minutes=MM, seconds=SS)
print('dt: {}'.format(dt))
t2 = t + dt
print('t2: {}'.format(t2))
print('t2.date: {} | t2.time: {}'.format(str(t2.date()), str(t2.time()).split('.')[0]))
Output:
t: 2019-10-24 04:43:08.255027
dt: 20:30:20
t2: 2019-10-25 01:13:28.255027
t2.date: 2019-10-25 | t2.time: 01:13:28
For Your Usecase
# Define Custom Function
def process_row(time_string):
HH, MM, SS = [int(x) for x in time_string.split(':')]
dt = timedelta(hours=HH, minutes=MM, seconds=SS)
return dt
# Make Dummy Data
original_col = ["23:45:00", "23:51:00", "24:04:00", "23:42:00", "26:01:00"]
df = pd.DataFrame({'original_col': original_col, 'dt': None})
# Process Dataframe
df['dt'] = df.apply(lambda x: process_row(x['original_col']), axis=1)
df['t'] = datetime.utcnow()
df['t2'] = df['dt'] + df['t']
# extracting date from timestamp
df['Date'] = [datetime.date(d) for d in df['t2']]
# extracting time from timestamp
df['Time'] = [datetime.time(d) for d in df['t2']]
df
Output:
Using pandas.to_datetime():
pd.to_datetime(df['t2'], format='%H:%M:%S',errors='coerce')
Output:
0 2019-10-25 09:38:39.349410
1 2019-10-25 09:44:39.349410
2 2019-10-25 09:57:39.349410
3 2019-10-25 09:35:39.349410
4 2019-10-25 11:54:39.349410
Name: t2, dtype: datetime64[ns]
References
How to construct a timedelta object from a simple string
Related
from datetime import datetime
start='12:25:03'
format = '%H:%M:%S'
startDateTime = datetime.strptime(start, format)
end='12:30:40'
endDateTime = datetime.strptime(end, format)
diff = endDateTime - startDateTime
print(diff)
0:05:37
Above code works fine, but when I apply the same to entire column using lambda function I get result in different format, I would like to get values of Diff column in hh:mm:ss format.
t1 - Object type
t2 - Object type
Diff - timedelta64[ns] type
df["Diff"] = df.apply(lambda x: datetime.strptime(x["t1"], format) - datetime.strptime(x["t2"], format), axis = 1)
df.head()
t1 t2 Diff
0 01:27:19 01:28:58 -1 days +23:58:21
1 01:49:57 01:50:40 -1 days +23:59:17
2 03:35:24 03:36:14 -1 days +23:59:10
related: Format timedelta to string
You can write your own formatter for the timedelta objects, e.g.
def formatTimedelta(td):
"""
format a timedelta object to string, in HH:MM:SS format (seconds floored).
negative timedeltas will be prefixed with a minus, '-'.
"""
total = td.total_seconds()
prefix, total = ('-', total*-1) if total < 0 else ('', total)
h, r = divmod(total, 3600)
m, s = divmod(r, 60)
return f"{prefix}{int(h):02d}:{int(m):02d}:{int(s):02d}"
which would give you for the example df
df
t1 t2
0 01:27:19 01:28:58
1 01:49:57 01:50:40
2 03:35:24 03:36:14
# to datetime
df['t1'] = pd.to_datetime(df['t1'])
df['t2'] = pd.to_datetime(df['t2'])
# calculate timedeltas and format
df['diff0'] = (df['t1']-df['t2']).apply(formatTimedelta)
df['diff1'] = (df['t2']-df['t1']).apply(formatTimedelta)
df['diff0']
0 -00:01:39
1 -00:00:43
2 -00:00:50
Name: diff0, dtype: object
df['diff1']
0 00:01:39
1 00:00:43
2 00:00:50
Name: diff1, dtype: object
I have to calculate mean() of time column, but this column type is string, how can I do it?
id time
1 1h:2m
2 1h:58m
3 35m
4 2h
...
You can use regex to extract hours and minutes. To calcualte the mean time in minutus:
h = df['time'].str.extract('(\d{1,2})h').fillna(0).astype(int)
m = df['time'].str.extract('(\d{1,2})m').fillna(0).astype(int)
(h * 60 + m).mean()
Result:
0 83.75
dtype: float64
It's largely inspired from How to construct a timedelta object from a simple string, but you can do as below:
def convertToSecond(time_str):
regex=re.compile(r'((?P<hours>\d+?)h)?:*((?P<minutes>\d+?)m)?:*((?P<seconds>\d+?)s)?')
parts = regex.match(time_str)
if not parts:
return
parts = parts.groupdict()
time_params = {}
for (name, param) in parts.items():
if param:
time_params[name] = int(param)
return timedelta(**time_params).total_seconds()
df = pd.DataFrame({
'time': ['1h:2m', '1h:58m','35m','2h'],})
df['inSecond']=df['time'].apply(convertToSecond)
mean_inSecond=df['inSecond'].mean()
print(f"Mean of Time Column: {datetime.timedelta(seconds=mean_inSecond)}")
Result:
Mean of Time Column: 1:23:45
Another possibility is to convert your string column into timedelta (since they don't seem to be times but rather durations?).
Since your strings are not all formatted equally, you unfortinately cannot use pandas' to_timedelta function. However, parser from dateutil has an option fuzzy that you can use to convert your column to datetime. If you subtract midnight today from that, you get the value as a timedelta.
import pandas as pd
from dateutil import parser
from datetime import date
from datetime import datetime
df = pd.DataFrame([[1,'1h:2m'],[2,'1h:58m'],[3,'35m'],[4,'2h']],columns=['id','time'])
today = date.today()
midnight = datetime.combine(today, datetime.min.time())
df['time'] = df['time'].apply(lambda x: (parser.parse(x, fuzzy=True)) - midnight)
This will convert your dataframe like this (print(df)):
id time
0 1 01:02:00
1 2 01:58:00
2 3 00:35:00
3 4 02:00:00
from which you can calculate the mean using print(df['time'].mean()):
0 days 01:23:45
Full example: https://ideone.com/Aze9mR
I have a large dataframe containing a Timestamp column like the one shown below:
Timestamp
16T122109960
16T122109965
16T122109970
16T122109975
[73853 rows x 1 columns]
I need to convert this into a seconds (formatted 12.523) since first timestamp column using something like this:
start_time = log_file['Timestamp'][0]
log_file['Timestamp'] = log_file.Timestamp.apply(lambda x: x - start_time)
But first I need to parse the timestamps into seconds as quickly as possible, I've tried using regex to split the timestamp into hours, minuntes, seconds, and milliseconds and then multipling & dividing appropriatly but was given a memory error. Is there a function within datetime or dateutils that would help?
The method I have used at the moment is below:
def regex_time(time):
list = re.split(r"(\d*)(T)(\d{2})(\d{2})(\d{2})(\d{3})", time)
date, delim, hours, minutes, seconds, mills = list[1:-1]
seconds = int(seconds)
seconds += int(mills) /1000
seconds += int(minutes) * 60
seconds += int(hours) * 3600
return seconds
df['Timestamp'] = df.Timestamp.apply(lambda j: regex_time(j))
You could try to convert the timestamp to datetime format and then extract the seconds in the format you want.
Here I attach you a code sample of how it works:
from datetime import datetime
timestamp = 1545730073
dt_object = datetime.fromtimestamp(timestamp)
seconds = dt_object.strftime("%S.%f")
print(seconds)
Output:
53.000000
You can also apply it to the dataframe you are using, for instance:
from datetime import datetime
df = pd.DataFrame({'timestamp':[1545730073]})
df['datetime'] = df['timestamp'].apply(lambda x: datetime.fromtimestamp(x))
df['seconds'] = df['datetime'] .apply(lambda x : x.strftime("%S.%f"))
And it will return a dataFrame containing:
timestamp datetime seconds
0 1545730073 2018-12-25 10:27:53 53.000000
you could parse the string with strptime, subtract the start_time as a pd.Timestamp and use the total_seconds() of the resulting timedelta:
import pandas as pd
df = pd.DataFrame({'Timestamp': ['16T122109960','16T122109965','16T122109970','16T122109975']})
start_time = pd.Timestamp('1900-01-01')
df['totalseconds'] = (pd.to_datetime(df['Timestamp'], format='%dT%H%M%S%f')-start_time).dt.total_seconds()
df['totalseconds']
# 0 1340469.960
# 1 1340469.965
# 2 1340469.970
# 3 1340469.975
# Name: totalseconds, dtype: float64
To use the first entry of the 'Timestamp' column as reference time start_time, use
start_time = pd.to_datetime(df['Timestamp'].iloc[0], format='%dT%H%M%S%f')
I have a dataframe that looks like the following:
arrival departure
0 23:55:00 23:57:00
1 23:57:00 23:59:00
2 23:59:00 24:01:00
3 24:01:00 24:03:00
4 24:03:00 24:05:00
I am working with data that cover a whole day and part of the day after. Data are (most of the time) in the HH:MM:SS format. However some time values are higher than 23:59:59 and go up to 27:00:00.
I would like to get the time difference between departure and arrival columns.
I tried using datetime to do that but I guess something went wrong:
FMT = '%H:%M:%S'
delta = datetime.strptime(df['departure'], FMT) - datetime.strptime(df['arrival'], FMT)
Which raises the following error:
ValueError: time data '24:01:00' does not match format '%H:%M:%S'
Is there a way to get the time difference between these two columns even though their format do not always match the HH:MM:SS format?
You could use timedelta from datetime
import datetime
delta1 = datetime.timedelta(hours=23, minutes=59, seconds=0)
delta2 = datetime.timedelta(hours=24, minutes=01, seconds=0)
timedelta = delta2 - delta1
>>> timedelta # or timedelta.to_seconds()
datetime.timedelta(seconds=120)
Give you the delta in seconds. Full example:
import datetime
arrival = "24:01:00"
departure = "24:03:00"
def get_time_from_string(t):
return dict(
zip(["hours", "minutes", "seconds"], list(map(lambda x: int(x), t.split(":"))),)
)
delta1 = datetime.timedelta(**get_time_from_string(arrival))
delta2 = datetime.timedelta(**get_time_from_string(departure))
delta = delta2 - delta1
print(delta.total_seconds())
I have a series where the timestamp is in the format HHHHH:MM:
timestamp = pd.Series(['34:23', '125:26', '15234:52'], index=index)
I would like to convert it to a timedelta series.
For now I manage to do that on a single string:
str[:-3]
str[-2:]
timedelta(hours=int(str[:-3]),minutes=int(str[-2:]))
I would like to apply it to the whole series, if possible in a cleaner way. Is there a way to do this?
You can use column-wise Pandas methods:
s = pd.Series(['34:23','125:26','15234:52'])
v = s.str.split(':', expand=True).astype(int)
s = pd.to_timedelta(v[0], unit='h') + pd.to_timedelta(v[1], unit='m')
print(s)
0 1 days 10:23:00
1 5 days 05:26:00
2 634 days 18:52:00
dtype: timedelta64[ns]
As pointed out in comments, this can also be achieved in one line, albeit less clear:
s = pd.to_timedelta((s.str.split(':', expand=True).astype(int) * (60, 1)).sum(axis=1), unit='min')
This is how I would do it:
timestamp = pd.Series(['34:23','125:26','15234:52'])
x = timestamp.str.split(":").apply(lambda x: int(x[0])*60 + int(x[1]))
timestamp = pd.to_timedelta(x, unit='s')
Parse the delta in seconds as an argument to pd.to_timedelta like this,
In [1]: import pandas as pd
In [2]: ts = pd.Series(['34:23','125:26','15234:52'])
In [3]: secs = 60 * ts.apply(lambda x: 60*int(x[:-3]) + int(x[-2:]))
In [4]: pd.to_timedelta(secs, 's')
Out[4]:
0 1 days 10:23:00
1 5 days 05:26:00
2 634 days 18:52:00
dtype: timedelta64[ns]
Edit: missed erncyp's answer which would work as well but you need to multiply the argument to pd.to_timedelta by 60 since if I recall correctly minutes aren't an available as a measure of elapsed time except modulo the previous hour.
You can use pandas.Series.apply, i.e.:
def convert(args):
return timedelta(hours=int(args[:-3]),minutes=int(args[-2:]))
s = pd.Series(['34:23','125:26','15234:52'])
s = s.apply(convert)