I have a df with the time and with the milliseconds in another columns like this:
Time ms
0 14:11:52 0
1 4:11:52 250
1 4:11:52 500
1 4:11:52 750
I want to add the milliseconds to the time like this:
Time
0 14:11:52
1 4:11:52:250
1 4:11:52:500
1 4:11:52:750
I tried converting both to datetime[ns] and [D] but I get the following error: cannot add DatetimeArray and DatetimeArrayt
df['Time'] = pd.to_datetime(df['Time'], format='%H:%M:%S')
df['ms'] = pd.to_datetime(df['ms'], format='%f')
df['Time'] = df['Time'] + df['ms']
I think that by using a time delta is possible to achieve what I want, but is there a cleaner way to just add one date column with another one?
IIUC two to_timedelta
pd.to_timedelta(df.Time)+pd.to_timedelta(df.ms,unit='ms')
Out[72]:
0 14:11:52
1 04:11:52.250000
1 04:11:52.500000
1 04:11:52.750000
dtype: timedelta64[ns]
df['Time']=pd.to_timedelta(df.Time)+pd.to_timedelta(df.ms,unit='ms')
Pandas' time mangling principle is simple:
datetime - datetime = timedelta
datetime + timedelta = datetime
The rest of the combinations will not work at all, or at least not as expected.
Related
Hello,
I am trying to extract date and time column from my excel data. I am getting column as DataFrame with float values, after using pandas.to_datetime I am getting date with different date than actual date from excel. for example, in excel starting date is 01.01.1901 00:00:00 but in python I am getting 1971-01-03 00:00:00.000000 like this.
How can I solve this problem?
I need a final output in total seconds with DataFrame. First cell starting as a 00 sec and very next cell with timestep of seconds (time difference in ever cell is 15min.)
Thank you.
Your input is fractional days, so there's actually no need to convert to datetime if you want the duration in seconds relative to the first entry. Subtract that from the rest of the column and multiply by the number of seconds in a day:
import pandas as pd
df = pd.DataFrame({"Datum/Zeit": [367.0, 367.010417, 367.020833]})
df["totalseconds"] = (df["Datum/Zeit"] - df["Datum/Zeit"].iloc[0]) * 86400
df["totalseconds"]
0 0.0000
1 900.0288
2 1799.9712
Name: totalseconds, dtype: float64
If you have to use datetime, you'll need to convert to timedelta (duration) to do the same, e.g. like
df["datetime"] = pd.to_datetime(df["Datum/Zeit"], unit="d")
# df["datetime"]
# 0 1971-01-03 00:00:00.000000
# 1 1971-01-03 00:15:00.028800
# 2 1971-01-03 00:29:59.971200
# Name: datetime, dtype: datetime64[ns]
# subtraction of datetime from datetime gives timedelta, which has total_seconds:
df["totalseconds"] = (df["datetime"] - df["datetime"].iloc[0]).dt.total_seconds()
# df["totalseconds"]
# 0 0.0000
# 1 900.0288
# 2 1799.9712
# Name: totalseconds, dtype: float64
I have to calculate mean() of time column, but this column type is string, how can I do it?
id time
1 1h:2m
2 1h:58m
3 35m
4 2h
...
You can use regex to extract hours and minutes. To calcualte the mean time in minutus:
h = df['time'].str.extract('(\d{1,2})h').fillna(0).astype(int)
m = df['time'].str.extract('(\d{1,2})m').fillna(0).astype(int)
(h * 60 + m).mean()
Result:
0 83.75
dtype: float64
It's largely inspired from How to construct a timedelta object from a simple string, but you can do as below:
def convertToSecond(time_str):
regex=re.compile(r'((?P<hours>\d+?)h)?:*((?P<minutes>\d+?)m)?:*((?P<seconds>\d+?)s)?')
parts = regex.match(time_str)
if not parts:
return
parts = parts.groupdict()
time_params = {}
for (name, param) in parts.items():
if param:
time_params[name] = int(param)
return timedelta(**time_params).total_seconds()
df = pd.DataFrame({
'time': ['1h:2m', '1h:58m','35m','2h'],})
df['inSecond']=df['time'].apply(convertToSecond)
mean_inSecond=df['inSecond'].mean()
print(f"Mean of Time Column: {datetime.timedelta(seconds=mean_inSecond)}")
Result:
Mean of Time Column: 1:23:45
Another possibility is to convert your string column into timedelta (since they don't seem to be times but rather durations?).
Since your strings are not all formatted equally, you unfortinately cannot use pandas' to_timedelta function. However, parser from dateutil has an option fuzzy that you can use to convert your column to datetime. If you subtract midnight today from that, you get the value as a timedelta.
import pandas as pd
from dateutil import parser
from datetime import date
from datetime import datetime
df = pd.DataFrame([[1,'1h:2m'],[2,'1h:58m'],[3,'35m'],[4,'2h']],columns=['id','time'])
today = date.today()
midnight = datetime.combine(today, datetime.min.time())
df['time'] = df['time'].apply(lambda x: (parser.parse(x, fuzzy=True)) - midnight)
This will convert your dataframe like this (print(df)):
id time
0 1 01:02:00
1 2 01:58:00
2 3 00:35:00
3 4 02:00:00
from which you can calculate the mean using print(df['time'].mean()):
0 days 01:23:45
Full example: https://ideone.com/Aze9mR
I have dataframe in following format:
> buyer_id purch_id timestamp
> buyer_2 purch_2 1330767282
> buyer_3 purch_3 1330771685
> buyer_3 purch_4 1330778269
> buyer_4 purch_5 1330780256
> buyer_5 purch_6 1330813517
I want to ask for your advice how to convert timestamp column (in dataframe) into datetime and then extract only the time of the event into the new column??
Thanks!
assuming 'timestamp' is Unix time (seconds since the epoch), you can cast to_datetime provided the right unit ('s') and use the time part:
df['time'] = pd.to_datetime(df['timestamp'], unit='s').dt.time
df
Out[9]:
buyer_id purch_id timestamp time
0 buyer_2 purch_2 1330767282 09:34:42
1 buyer_3 purch_3 1330771685 10:48:05
2 buyer_3 purch_4 1330778269 12:37:49
3 buyer_4 purch_5 1330780256 13:10:56
4 buyer_5 purch_6 1330813517 22:25:17
I have a series where the timestamp is in the format HHHHH:MM:
timestamp = pd.Series(['34:23', '125:26', '15234:52'], index=index)
I would like to convert it to a timedelta series.
For now I manage to do that on a single string:
str[:-3]
str[-2:]
timedelta(hours=int(str[:-3]),minutes=int(str[-2:]))
I would like to apply it to the whole series, if possible in a cleaner way. Is there a way to do this?
You can use column-wise Pandas methods:
s = pd.Series(['34:23','125:26','15234:52'])
v = s.str.split(':', expand=True).astype(int)
s = pd.to_timedelta(v[0], unit='h') + pd.to_timedelta(v[1], unit='m')
print(s)
0 1 days 10:23:00
1 5 days 05:26:00
2 634 days 18:52:00
dtype: timedelta64[ns]
As pointed out in comments, this can also be achieved in one line, albeit less clear:
s = pd.to_timedelta((s.str.split(':', expand=True).astype(int) * (60, 1)).sum(axis=1), unit='min')
This is how I would do it:
timestamp = pd.Series(['34:23','125:26','15234:52'])
x = timestamp.str.split(":").apply(lambda x: int(x[0])*60 + int(x[1]))
timestamp = pd.to_timedelta(x, unit='s')
Parse the delta in seconds as an argument to pd.to_timedelta like this,
In [1]: import pandas as pd
In [2]: ts = pd.Series(['34:23','125:26','15234:52'])
In [3]: secs = 60 * ts.apply(lambda x: 60*int(x[:-3]) + int(x[-2:]))
In [4]: pd.to_timedelta(secs, 's')
Out[4]:
0 1 days 10:23:00
1 5 days 05:26:00
2 634 days 18:52:00
dtype: timedelta64[ns]
Edit: missed erncyp's answer which would work as well but you need to multiply the argument to pd.to_timedelta by 60 since if I recall correctly minutes aren't an available as a measure of elapsed time except modulo the previous hour.
You can use pandas.Series.apply, i.e.:
def convert(args):
return timedelta(hours=int(args[:-3]),minutes=int(args[-2:]))
s = pd.Series(['34:23','125:26','15234:52'])
s = s.apply(convert)
I have a pandas dataframe with a TEXT column called Used which contains the duration in minutes:seconds for phone calls. I would like to convert this to a a duration format. The problem is some of the minutes are greater than 59 so giving an error:
time data '67:01' does not match format '%M:%S'
The code to convert this is:
df.Used.apply(lambda x: datetime.datetime.strptime(x, '%M:%S'))
Is there a simple way to convert this to a decimal minutes format? Something like 67.01666 for 67:01?
Based on the documentation for the datetime object you can have values for minutes only between to [0-60) range:
The year, month and day arguments are required. tzinfo may be None, or an instance of a tzinfo subclass. The remaining arguments may be ints or longs, in the following ranges:
0 <= hour < 24
0 <= minute < 60
0 <= second < 60
0 <= microsecond < 1000000
So, there is no possible way to get rid of that error. if you want to convert this to a decimal minutes format, which I'm guessing is just a decimal you'll need to do it manually like so:
# Split the string, join it and cast it to float
df.Used.apply(lambda x : float(".".join(x.split(":"))))
Which outputs:
In [5]: df = pd.DataFrame([['87:01'],['911:11']],columns=['Used'])
In [6]: df.Used.apply(lambda x : float(".".join(x.split(":"))))
Out[6]:
0 87.01
1 911.11
Name: Used, dtype: float64
I used the following which seems similar to some of the answers above. Using split I made two dataframes, one for minutes and another for seconds which I converted to float and then combined them to form a decimal column in the original dataframe.
test_df = home_df.Used.str.split(':')
minutes_df = test_df.str[0]
seconds_df = test_df.str[1]
minutes_df = minutes_df.astype(float)
seconds_df = seconds_df.astype(float)
decmin_df = minutes_df + seconds_df / 60.
home_df['Duration'] = decmin_df
If you are storing duration , I would suggest that the correct way for storing durations would be Timedelta , not datetime (since a datetime always requires an year/month/day , etc. basically datetime is used to denote exact dates/times).
For that an quick/easy way would be to split the string based on : and then pass them separately to minutes and seconds argument of datetime.timedelta . Example -
df.Used.apply(lambda x: datetime.timedelta(minutes=int(x.split(':')[0]), seconds=int(x.split(':')[1])))
Demo -
In [15]: import pandas as pd
In [16]: df = pd.DataFrame([['67:01'],['11:11'],['59:59'],['09:08']],columns=['Used'])
In [17]: df
Out[17]:
Used
0 67:01
1 11:11
2 59:59
3 09:08
In [18]: import datetime
In [19]: df.Used.apply(lambda x: datetime.timedelta(minutes=int(x.split(':')[0]), seconds=int(x.split(':')[1])))
Out[19]:
0 01:07:01
1 00:11:11
2 00:59:59
3 00:09:08
Name: Used, dtype: timedelta64[ns]
If you want it as float, you can also do it with a simple change -
df.Used.apply(lambda x: float(x.split(':')[0]) + float(x.split(':')[1])/60)