I have a CSV file that contains start-time and end-time for sessions.
I would like to understand how I can do End-time - Start-time to get the duration of a session.
So far I have this and it works
start_time = "2016-11-09 18:06:17"
end_time ="2016-11-09 18:21:07"
start_dt = dt.datetime.strptime(start_time, '%Y-%m-%d %H:%M:%S')
end_dt = dt.datetime.strptime(end_time, '%Y-%m-%d %H:%M:%S')
diff = (end_dt - start_dt)
duration = diff.seconds/60
print (duration)
but I want to do it for the whole column at once.
To import from a csv and then manipulate the date, pandas is the way to go. Since the only info you gave about your data was start and end time, I will show that.
Code:
import pandas as pd
df = pd.read_csv(data, parse_dates=['start_time', 'end_time'],
infer_datetime_format=True)
print(df)
df['time_delta'] = df.end_time.values - df.start_time.values
print(df.time_delta)
Test Data:
from io import StringIO
data = StringIO(u'\n'.join([x.strip() for x in """
start_time,end_time,a_number
2013-09-19 03:00:00,2013-09-19 04:00:00,221.0797
2013-09-19 04:00:00,2013-09-19 05:00:00,220.5083
2013-09-24 03:00:00,2013-09-24 05:00:00,221.7733
2013-09-24 04:00:00,2013-09-24 06:00:00,221.2493
""".split('\n')[1:-1]]))
Results:
start_time end_time a_number
0 2013-09-19 03:00:00 2013-09-19 04:00:00 221.0797
1 2013-09-19 04:00:00 2013-09-19 05:00:00 220.5083
2 2013-09-24 03:00:00 2013-09-24 05:00:00 221.7733
3 2013-09-24 04:00:00 2013-09-24 06:00:00 221.2493
0 01:00:00
1 01:00:00
2 02:00:00
3 02:00:00
Name: time_delta, dtype: timedelta64[ns]
It seems you are trying to run diff against strings, instead of datetime values.
How about something like this?
from datetime import datetime
start_time = datetime(2016, 11, 12, 18, 06, 17)
end_time = datetime(2016, 11, 09, 18, 21, 07)
diff = end_time - start_time
print(diff.seconds / 60)
I think this should work.
Related
I have a dataframe:
data = {'time':['08:45:00', '09:30:00', '18:00:00', '15:00:00']}
df = pd.DataFrame(data)
I would like to convert the time based on conditions: if the hour is less than 9, I want to set it to 9 and if the hour is more than 17, I need to set it to 17.
I tried this approach:
df['time'] = np.where(((df['time'].dt.hour < 9) & (df['time'].dt.hour != 0)), dt.time(9, 00))
I am getting an error: Can only use .dt. accesor with datetimelike values.
Can anyone please help me with this? Thanks.
Here's a way to do what your question asks:
df.time = pd.to_datetime(df.time)
df.loc[df.time.dt.hour < 9, 'time'] = (df.time.astype('int64') + (9 - df.time.dt.hour)*3600*1000000000).astype('datetime64[ns]')
df.loc[df.time.dt.hour > 17, 'time'] = (df.time.astype('int64') + (17 - df.time.dt.hour)*3600*1000000000).astype('datetime64[ns]')
Input:
time
0 2022-06-06 08:45:00
1 2022-06-06 09:30:00
2 2022-06-06 18:00:00
3 2022-06-06 15:00:00
Output:
time
0 2022-06-06 09:45:00
1 2022-06-06 09:30:00
2 2022-06-06 17:00:00
3 2022-06-06 15:00:00
UPDATE:
Here's alternative code to try to address OP's error as described in the comments:
import pandas as pd
import datetime
data = {'time':['08:45:00', '09:30:00', '18:00:00', '15:00:00']}
df = pd.DataFrame(data)
print('', 'df loaded as strings:', df, sep='\n')
df.time = pd.to_datetime(df.time, format='%H:%M:%S')
print('', 'df converted to datetime by pd.to_datetime():', df, sep='\n')
df.loc[df.time.dt.hour < 9, 'time'] = (df.time.astype('int64') + (9 - df.time.dt.hour)*3600*1000000000).astype('datetime64[ns]')
df.loc[df.time.dt.hour > 17, 'time'] = (df.time.astype('int64') + (17 - df.time.dt.hour)*3600*1000000000).astype('datetime64[ns]')
df.time = [time.time() for time in pd.to_datetime(df.time)]
print('', 'df with time column adjusted to have hour between 9 and 17, converted to type "time":', df, sep='\n')
Output:
df loaded as strings:
time
0 08:45:00
1 09:30:00
2 18:00:00
3 15:00:00
df converted to datetime by pd.to_datetime():
time
0 1900-01-01 08:45:00
1 1900-01-01 09:30:00
2 1900-01-01 18:00:00
3 1900-01-01 15:00:00
df with time column adjusted to have hour between 9 and 17, converted to type "time":
time
0 09:45:00
1 09:30:00
2 17:00:00
3 15:00:00
UPDATE #2:
To not just change the hour for out-of-window times, but to simply apply 9:00 and 17:00 as min and max times, respectively (see OP's comment on this), you can do this:
df.loc[df['time'].dt.hour < 9, 'time'] = pd.to_datetime(pd.DataFrame({
'year':df['time'].dt.year, 'month':df['time'].dt.month, 'day':df['time'].dt.day,
'hour':[9]*len(df.index)}))
df.loc[df['time'].dt.hour > 17, 'time'] = pd.to_datetime(pd.DataFrame({
'year':df['time'].dt.year, 'month':df['time'].dt.month, 'day':df['time'].dt.day,
'hour':[17]*len(df.index)}))
df['time'] = [time.time() for time in pd.to_datetime(df['time'])]
Since your 'time' column contains strings they can kept as strings and assign new string values where appropriate. To filter for your criteria it is convenient to: create datetime Series from the 'time' column, create boolean Series by comparing the datetime Series with your criteria, use the boolean Series to filter the rows which need to be changed.
Your data:
import numpy as np
import pandas as pd
data = {'time':['08:45:00', '09:30:00', '18:00:00', '15:00:00']}
df = pd.DataFrame(data)
print(df.to_string())
>>>
time
0 08:45:00
1 09:30:00
2 18:00:00
3 15:00:00
Convert to datetime, make boolean Series with your criteria
dts = pd.to_datetime(df['time'])
lt_nine = dts.dt.hour < 9
gt_seventeen = (dts.dt.hour >= 17)
print(lt_nine)
print(gt_seventeen)
>>>
0 True
1 False
2 False
3 False
Name: time, dtype: bool
0 False
1 False
2 True
3 False
Name: time, dtype: bool
Use the boolean series to assign a new value:
df.loc[lt_nine,'time'] = '09:00:00'
df.loc[gt_seventeen,'time'] = '17:00:00'
print(df.to_string())
>>>
time
0 09:00:00
1 09:30:00
2 17:00:00
3 15:00:00
Or just stick with strings altogether and create the boolean Series using regex patterns and .str.match.
data = {'time':['08:45:00', '09:30:00', '18:00:00', '15:00:00','07:22:00','22:02:06']}
dg = pd.DataFrame(data)
print(dg.to_string())
>>>
time
0 08:45:00
1 09:30:00
2 18:00:00
3 15:00:00
4 07:22:00
5 22:02:06
# regex patterns
pattern_lt_nine = '^00|01|02|03|04|05|06|07|08'
pattern_gt_seventeen = '^17|18|19|20|21|22|23'
Make boolean Series and assign new values
gt_seventeen = dg['time'].str.match(pattern_gt_seventeen)
lt_nine = dg['time'].str.match(pattern_lt_nine)
dg.loc[lt_nine,'time'] = '09:00:00'
dg.loc[gt_seventeen,'time'] = '17:00:00'
print(dg.to_string())
>>>
time
0 09:00:00
1 09:30:00
2 17:00:00
3 15:00:00
4 09:00:00
5 17:00:00
Time series / date functionality
Working with text data
This is my dataframe.
Start_hour End_date
23:58:00 00:26:00
23:56:00 00:01:00
23:18:00 23:36:00
How can I get in a new column the difference (in minutes) between these two columns?
>>> from datetime import datetime
>>>
>>> before = datetime.now()
>>> print('wait for more than 1 minute')
wait for more than 1 minute
>>> after = datetime.now()
>>> td = after - before
>>>
>>> td
datetime.timedelta(seconds=98, microseconds=389121)
>>> td.total_seconds()
98.389121
>>> td.total_seconds() / 60
1.6398186833333335
Then you can round it or use it as-is.
You can do something like this:
import pandas as pd
df = pd.DataFrame({
'Start_hour': ['23:58:00', '23:56:00', '23:18:00'],
'End_date': ['00:26:00', '00:01:00', '23:36:00']}
)
df['Start_hour'] = pd.to_datetime(df['Start_hour'])
df['End_date'] = pd.to_datetime(df['End_date'])
df['diff'] = df.apply(
lambda row: (row['End_date']-row['Start_hour']).seconds / 60,
axis=1
)
print(df)
Start_hour End_date diff
0 2021-03-29 23:58:00 2021-03-29 00:26:00 28.0
1 2021-03-29 23:56:00 2021-03-29 00:01:00 5.0
2 2021-03-29 23:18:00 2021-03-29 23:36:00 18.0
You can also rearrange your dates as string again if you like:
df['Start_hour'] = df['Start_hour'].apply(lambda x: x.strftime('%H:%M:%S'))
df['End_date'] = df['End_date'].apply(lambda x: x.strftime('%H:%M:%S'))
print(df)
Output:
Start_hour End_date diff
0 23:58:00 00:26:00 28.0
1 23:56:00 00:01:00 5.0
2 23:18:00 23:36:00 18.0
Short answer:
df['interval'] = df['End_date'] - df['Start_hour']
df['interval'][df['End_date'] < df['Start_hour']] += timedelta(hours=24)
Why so:
You probably trying to solve the problem that your Start_hout and End_date values sometimes belong to a different days, and that's why you can't just substutute one from the other.
It your time window never exceeds 24 hours interval, you could use some modular arithmetic to deal with 23:59:59 - 00:00:00 border:
if End_date < Start_hour, this always means End_date belongs to a next day
this implies, if End_date - Start_hour < 0 then we should add 24 hours to End_date to find the actual difference
The final formula is:
if rec['Start_hour'] < rec['End_date']:
offset = 0
else:
offset = timedelta(hours=24)
rec['delta'] = offset + rec['End_date'] - rec['Start_hour']
To do the same with pandas.DataFrame we need to change code accordingly. And
that's how we get the snippet from the beginning of the answer.
import pandas as pd
df = pd.DataFrame([
{'Start_hour': datetime(1, 1, 1, 23, 58, 0), 'End_date': datetime(1, 1, 1, 0, 26, 0)},
{'Start_hour': datetime(1, 1, 1, 23, 58, 0), 'End_date': datetime(1, 1, 1, 23, 59, 0)},
])
# ...
df['interval'] = df['End_date'] - df['Start_hour']
df['interval'][df['End_date'] < df['Start_hour']] += timedelta(hours=24)
> df
Start_hour End_date interval
0 0001-01-01 23:58:00 0001-01-01 00:26:00 0 days 00:28:00
1 0001-01-01 23:58:00 0001-01-01 23:59:00 0 days 00:01:00
I'm using pandas 1.0.3 with Python 3.7.7.
filelist.txt contains a list of filenames, sample below:
['2017_12_31_06_05_22_0568015522_2.40E_54.03N_VV_C11_GFS025CDF_wind_level2.nc\n']
['2017_12_30_06_14_22_0567929662_0.27E_53.81N_VV_C11_GFS025CDF_wind_level2.nc\n']
['2017_12_29_06_21_46_0567843706_1.64W_54.27N_VV_C11_GFS025CDF_wind_level2.nc\n']
['2017_12_28_17_42_04_0567798124_0.95E_54.10N_VV_C11_GFS025CDF_wind_level2.nc\n']
I use the code below to extract the date and time from this list to try and find the closest datetime in df_lidar, sample below:
datetime_copy x datetime
2017-12-30 00:00:00 290.0 2017-12-30 00:00:00
2017-12-31 00:10:00 290.0 2017-12-31 00:10:00
2017-12-31 00:20:00 290.0 2017-12-31 00:20:00
2017-12-31 00:30:00 290.0 2017-12-31 00:30:00
2017-12-31 00:40:00 290.0 2017-12-31 00:40:00
2017-12-31 00:50:00 290.0 2017-12-31 00:50:00
2017-12-31 01:00:00 290.0 2017-12-31 01:00:00
Both datetimes are added to df_events so I can then compare the difference between the dates using df_events['time_diff']=df_events['closest_lidar']-df_events['SAR_time'].
This fails with
TypeError: unsupported operand type(s) for -: 'str' and 'str'.
df_events['time_diff']=df_events['closest_lidar'].astype(datetime.timedelta)-df_events['SAR_time'].astype(datetime.timedelta) gives the following error TypeError: dtype '<class 'datetime.timedelta'>' not understood.
I would like your help to get these into the same format so I can calculate the time difference to the nearest time in df_lidar
Looking at the df_events :
SAR_time closest_lidar
0 SAR_time closest_lidar
1 2017-12-30 06:14:22 "2017-12-10 13:50:00 2017-12-10 13:50:00
Name: datetime, dtype: datetime64[ns]"
2 2017-12-29 06:21:46 "2017-12-10 13:50:00 2017-12-10 13:50:00
Name: datetime, dtype: datetime64[ns]"
The datetimes are formatted differently, despite using pd.to_datetime() for both columns.
print( type(df_x['datetime']))
<class 'pandas.core.series.Series'>
print (type(SAR_time))
<class 'pandas._libs.tslibs.timestamps.Timestamp'>
Full script below:
import os
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
#from windrose import WindroseAxes
import re
import datetime
user_name = os.getenv("USERNAME")
path = 'C:\\Users\\' + user_name + '\\' + 'data\\'
df_lidar = pd.read_pickle(path +'\lidar.pkl')
#filelist from satellite winds website
f = open(path + 'filelist.txt', 'r')
filelist = [line.split(',') for line in f.readlines()]
df_events= pd.DataFrame(index = range(len(filelist)))
#create empty columns to populate in loop
df_events['SAR_time']= np.nan
df_events['closest_lidar'] = np.nan
#go through each filename in filelist - extract date and find closest date in df_lidar
for i, j in enumerate(filelist):
event_raw = filelist[i]
event = str(event_raw).strip('['']')
event_time = re.findall(r'\d\d\d\d_\d\d_\d\d\_\d\d_\d\d\_\d\d', event)
event_string = str(event_time).strip('['']')
event_string= re.sub(r"[^0-9]", "", event_string)
event_string= re.sub(r"\s+", "", event_string)
event_timestamp = pd.to_datetime(event_string, infer_datetime_format=True)
idx = df_lidar.iloc[df_lidar.index.get_loc((event_timestamp), method ='nearest')]
df_x = idx.to_frame()
df_x = df_x.transpose()
df_x['datetime'] = pd.to_datetime(df_x['datetime'])
SAR_time = pd.to_datetime(event_timestamp)
#single date here but double date in df_events below
df_events.iloc[i] = {'SAR_time':event_timestamp, 'closest_lidar':df_x['datetime']}
df_events['time_diff']=df_events['closest_lidar']-df_events['SAR_time']
You need to convert the columns to datetime format by:
Eg.:
a b
0 2017-12-30 00:00:00 2017-12-30 00:00:00
1 2017-12-31 00:10:00 2017-12-30 00:00:00
Convert to datetime format:
df["a"] = pd.to_datetime(df["a"], format="%Y-%m-%d %H:%M:%S")
df["b"] = pd.to_datetime(df["b"], format="%Y-%m-%d %H:%M:%S")
Then you can find diff:
df["diff"] = df["a"]-df["b"]
Output:
a b diff
0 2017-12-30 00:00:00 2017-12-30 0 days 00:00:00
1 2017-12-31 00:10:00 2017-12-30 1 days 00:10:00
Hope this is what you are looking for?
I am working in a dataframe in Pandas that looks like this.
Identifier datetime
0 AL011851 00:00:00
1 AL011851 06:00:00
2 Al011851 12:00:00
This is my code so far:
import pandas as pd
hurricane_df = pd.read_csv("hurdat2.csv",parse_dates=['datetime'])
hurricane_df['datetime'] = pd.to_timedelta(hurricane_df['datetime'].dt.strftime('%H:%M:%S'))
hurricane_df
grouped = hurricane_df.groupby('datetime').size()
grouped
What I did was convert the datetime column to a timedelta to get the hours. I want to get the size of the datetime column but I want just hours like 1:00, 2:00, 3:00, etc. but I get minute intervals as well like 1:15 and 2:45.
Any way to just display the hour?
Thank you.
You can use pandas.Timestamp.round with Series.dt shortcut:
df['datetime'] = df['datetime'].dt.round('h')
So
... datetime
01:15:00
02:45:00
becomes
... datetime
01:00:00
03:00:00
df = pd.DataFrame({'Identifier':['AL011851','AL011851','AL011851'],'datetime': ["2018-12-08 16:35:23","2018-12-08 14:20:45", "2018-12-08 11:45:00"]})
df['datetime'] = pd.to_datetime(df['datetime'])
df
Identifier datetime
0 AL011851 2018-12-08 16:35:23
1 AL011851 2018-12-08 14:20:45
2 AL011851 2018-12-08 11:45:00
# Rounds to nearest hour
def roundHour(t):
return (t.replace(second=0, microsecond=0, minute=0, hour=t.hour)
+timedelta(hours=t.minute//30))
df.datetime=df.datetime.map(lambda t: roundHour(t)) # Step 1: Round to nearest hour
df.datetime=df.datetime.map(lambda t: t.strftime('%H:%M')) # Step 2: Remove seconds
df
Identifier datetime
0 AL011851 17:00
1 AL011851 14:00
2 AL011851 12:00
I have two columns, one has type datetime64 and datetime.time. The
first column has the day and the second one the hour and minutes. I
am having trouble parsing them:
Leistung_0011
ActStartDateExecution ActStartTimeExecution
0 2016-02-17 11:00:00
10 2016-04-15 07:15:00
20 2016-06-10 10:30:00
Leistung_0011['Start_datetime'] = pd.to_datetime(Leistung_0011['ActStartDateExecution'].astype(str) + ' ' + Leistung_0011['ActStartTimeExecution'].astype(str))
ValueError: ('Unknown string format:', 'NaT 00:00:00')
You can convert to str and join with whitespace before passing to pd.to_datetime:
df['datetime'] = pd.to_datetime(df['day'].astype(str) + ' ' + df['time'].astype(str))
print(df, df.dtypes, sep='\n')
# day time datetime
# 0 2018-01-01 15:00:00 2018-01-01 15:00:00
# 1 2015-12-30 05:00:00 2015-12-30 05:00:00
# day datetime64[ns]
# time object
# datetime datetime64[ns]
# dtype: object
Setup
from datetime import datetime
df = pd.DataFrame({'day': ['2018-01-01', '2015-12-30'],
'time': ['15:00', '05:00']})
df['day'] = pd.to_datetime(df['day'])
df['time'] = df['time'].apply(lambda x: datetime.strptime(x, '%H:%M').time())
print(df['day'].dtype, type(df['time'].iloc[0]), sep='\n')
# datetime64[ns]
# <class 'datetime.time'>
Complete example including seconds:
import pandas as pd
from io import StringIO
x = StringIO(""" ActStartDateExecution ActStartTimeExecution
0 2016-02-17 11:00:00
10 2016-04-15 07:15:00
20 2016-06-10 10:30:00""")
df = pd.read_csv(x, delim_whitespace=True)
df['ActStartDateExecution'] = pd.to_datetime(df['ActStartDateExecution'])
df['ActStartTimeExecution'] = df['ActStartTimeExecution'].apply(lambda x: datetime.strptime(x, '%H:%M:%S').time())
df['datetime'] = pd.to_datetime(df['ActStartDateExecution'].astype(str) + ' ' + df['ActStartTimeExecution'].astype(str))
print(df.dtypes)
ActStartDateExecution datetime64[ns]
ActStartTimeExecution object
datetime datetime64[ns]
dtype: object