I have raw data like this want to find the difference between this two time in mint .....problem is data which is in data frame...
source:
start time end time
0 08:30:00 17:30:00
1 11:00:00 17:30:00
2 08:00:00 21:30:00
3 19:30:00 22:00:00
4 19:00:00 00:00:00
5 08:30:00 15:30:00
Need a output like this:
duration
540mint
798mint
162mint
1140mint
420mint
Your expected output seems to be incorrect. That aside, we can use base R's difftime:
transform(
df,
duration = difftime(
strptime(end.time, format = "%H:%M:%S"),
strptime(start.time, format = "%H:%M:%S"),
units = "mins"))
# start.time end.time duration
#0 08:30:00 17:30:00 540 mins
#1 11:00:00 17:30:00 390 mins
#2 08:00:00 21:30:00 810 mins
#3 19:30:00 22:00:00 150 mins
#4 19:00:00 00:00:00 -1140 mins
#5 08:30:00 15:30:00 420 mins
or as a difftime vector
with(df, difftime(
strptime(end.time, format = "%H:%M:%S"),
strptime(start.time, format = "%H:%M:%S"),
units = "mins"))
#Time differences in mins
#[1] 540 390 810 150 -1140 420
Sample data
df <- read.table(text =
" 'start time' 'end time'
0 08:30:00 17:30:00
1 11:00:00 17:30:00
2 08:00:00 21:30:00
3 19:30:00 22:00:00
4 19:00:00 00:00:00
5 08:30:00 15:30:00", header = T, row.names = 1)
import pandas as pd
df = pd.DataFrame({'start time':['08:30:00','11:00:00','08:00:00','19:30:00','19:00:00','08:30:00'],'end time':['17:30:00','17:30:00','21:30:00','22:00:00','00:00:00','15:30:00']},columns=['start time','end time'])
df
Out[355]:
start time end time
0 08:30:00 17:30:00
1 11:00:00 17:30:00
2 08:00:00 21:30:00
3 19:30:00 22:00:00
4 19:00:00 00:00:00
5 08:30:00 15:30:00
(pd.to_datetime(df['end time']) - pd.to_datetime(df['start time'])).dt.seconds/60
Out[356]:
0 540.0
1 390.0
2 810.0
3 150.0
4 300.0
5 420.0
dtype: float64
Yes, definitely datetime is what you need here. Specifically, the strptime function, which parses a string into a time object.
from datetime import datetime
s1 = '10:33:26'
s2 = '11:15:49' # for example
FMT = '%H:%M:%S'
tdelta = datetime.strptime(s2, FMT) - datetime.strptime(s1, FMT)
That gets you a timedelta object that contains the difference between the two times. You can do whatever you want with that, e.g. converting it to seconds or adding it to another datetime.
This will return a negative result if the end time is earlier than the start time, for example s1 = 12:00:00 and s2 = 05:00:00. If you want the code to assume the interval crosses midnight in this case (i.e. it should assume the end time is never earlier than the start time), you can add the following lines to the above code:
if tdelta.days < 0:
tdelta = timedelta(days=0,
seconds=tdelta.seconds, microseconds=tdelta.microseconds)
(of course you need to include from datetime import timedelta somewhere). Thanks to J.F. Sebastian for pointing out this use case.
Related
I have a dataframe with a column 'queue_ist_dt'. This column contains pandas._libs.tslibs.timestamps.Timestamp values. My requirement is :
if time = 10:13:00 then round_off_time = 10:00:00
if time = 23:29:00 then round_off_time = 23:00:00
and so on.
if time = 10:31:00 then round_off_time = 10:30:00
if time = 23:53:00 then round_off_time = 23:30:00
and so on.
if time = 10:30:00 then round_off_time = 10:30:00
These are the 3 conditions.
I tried to write the following logic :
for r in range(df.shape[0]):
try:
if df.loc[r,'queue_ist_dt'].minute<30:
timedelta = pd.Timedelta(minutes=df.loc[r,'queue_ist_dt'].minute)
df.loc[r,'queue_placed_interval'] = df.loc[r,'queue_ist_dt']- timedelta
elif df.loc[r,'queue_ist_dt'].minute>30:
******NEED HELP TO BUILD THIS LOGIC******
except:
pass
Need help to build logic for the time where minutes is greater than 30 mins and have to be rounded down to 30 mins.
Use Series.dt.floor:
#if necessary convert to datetimes
df['queue_ist_dt'] = pd.to_datetime(df['queue_ist_dt'].astype(str))
df['queue_ist_dt1'] = df['queue_ist_dt'].dt.floor('30Min').dt.time
print (df)
Logic is subtract 30 minute from timedelta
code is as below:
for r in range(df.shape[0]):
try:
if df.loc[r,'queue_ist_dt'].minute<30:
timedelta = pd.Timedelta(minutes=df.loc[r,'queue_ist_dt'].minute)
df.loc[r,'queue_placed_interval'] = df.loc[r,'queue_ist_dt']- timedelta
elif df.loc[r,'queue_ist_dt'].minute>30:
******THIS LOGIC******
timedelta = pd.Timedelta(minutes=df.loc[r,'queue_ist_dt'].minute)
df.loc[r,'queue_placed_interval'] = df.loc[r,'queue_ist_dt']- (timedelta-30)
except:
pass
Let me know if this helps you😊
Considering this dataframe df as example
df = pd.DataFrame({'queue_ist_dt': [pd.Timestamp('2021-01-01 10:00:00'),
pd.Timestamp('2021-01-01 10:30:00'),
pd.Timestamp('2021-01-01 11:00:00'),
pd.Timestamp('2021-01-01 11:30:00'),
pd.Timestamp('2021-01-01 23:00:00'),
pd.Timestamp('2021-01-01 23:30:00'),
pd.Timestamp('2021-01-01 23:30:00')]
})
[Out]:
queue_ist_dt
0 2021-01-01 10:01:00
1 2021-01-01 10:35:00
2 2021-01-01 11:19:00
3 2021-01-01 11:33:00
4 2021-01-01 23:23:00
5 2021-01-01 23:22:00
6 2021-01-01 23:55:00
One way would be to use pandas.Series.dt.round as follows
df['round_off_time'] = df['queue_ist_dt'].dt.round('30min')
[Out]:
queue_ist_dt round_off_time
0 2021-01-01 10:01:00 2021-01-01 10:00:00
1 2021-01-01 10:35:00 2021-01-01 10:30:00
2 2021-01-01 11:19:00 2021-01-01 11:30:00
3 2021-01-01 11:33:00 2021-01-01 11:30:00
4 2021-01-01 23:23:00 2021-01-01 23:30:00
5 2021-01-01 23:22:00 2021-01-01 23:30:00
6 2021-01-01 23:55:00 2021-01-02 00:00:00
If the goal is to change the values in the column queue_ist_dt, do the following
df['queue_ist_dt'] = df['queue_ist_dt'].dt.round('30min')
[Out]:
queue_ist_dt
0 2021-01-01 10:00:00
1 2021-01-01 10:30:00
2 2021-01-01 11:30:00
3 2021-01-01 11:30:00
4 2021-01-01 23:30:00
5 2021-01-01 23:30:00
6 2021-01-02 00:00:00
I have the following table:
Hora_Retiro count_uses
0 00:00:18 1
1 00:00:34 1
2 00:02:27 1
3 00:03:13 1
4 00:06:45 1
... ... ...
748700 23:58:47 1
748701 23:58:49 1
748702 23:59:11 1
748703 23:59:47 1
748704 23:59:56 1
And I want to group all values within each hour, so I can see the total number of uses per hour (00:00:00 - 23:00:00)
I have the following code:
hora_pico_aug= hora_pico.groupby(pd.Grouper(key="Hora_Retiro",freq='H')).count()
Hora_Retiro column is of timedelta64[ns] type
Which gives the following output:
count_uses
Hora_Retiro
00:00:02 2566
01:00:02 602
02:00:02 295
03:00:02 5
04:00:02 10
05:00:02 4002
06:00:02 16075
07:00:02 39410
08:00:02 76272
09:00:02 56721
10:00:02 36036
11:00:02 32011
12:00:02 33725
13:00:02 41032
14:00:02 50747
15:00:02 50338
16:00:02 42347
17:00:02 54674
18:00:02 76056
19:00:02 57958
20:00:02 34286
21:00:02 22509
22:00:02 13894
23:00:02 7134
However, the index column starts at 00:00:02, and I want it to start at 00:00:00, and then go from one hour intervals. Something like this:
count_uses
Hora_Retiro
00:00:00 2565
01:00:00 603
02:00:00 295
03:00:00 5
04:00:00 10
05:00:00 4002
06:00:00 16075
07:00:00 39410
08:00:00 76272
09:00:00 56721
10:00:00 36036
11:00:00 32011
12:00:00 33725
13:00:00 41032
14:00:00 50747
15:00:00 50338
16:00:00 42347
17:00:00 54674
18:00:00 76056
19:00:00 57958
20:00:00 34286
21:00:00 22509
22:00:00 13894
23:00:00 7134
How can i make it to start at 00:00:00??
Thanks for the help!
You can create an hour column from Hora_Retiro column.
df['hour'] = df['Hora_Retiro'].dt.hour
And then groupby on the basis of hour
gpby_df = df.groupby('hour')['count_uses'].sum().reset_index()
gpby_df['hour'] = pd.to_datetime(gpby_df['hour'], format='%H').dt.time
gpby_df.columns = ['Hora_Retiro', 'sum_count_uses']
gpby_df
gives
Hora_Retiro sum_count_uses
0 00:00:00 14
1 09:00:00 1
2 10:00:00 2
3 20:00:00 2
I assume that Hora_Retiro column in your DataFrame is of
Timedelta type. It is not datetime, as in this case there
would be printed also the date part.
Indeed, your code creates groups starting at the minute / second
taken from the first row.
To group by "full hours":
round each element in this column to hour,
then group (just by this rounded value).
The code to do it is:
hora_pico.groupby(hora_pico.Hora_Retiro.apply(
lambda tt: tt.round('H'))).count_uses.count()
However I advise you to make up your mind, what do you want to count:
rows or values in count_uses column.
In the second case replace count function with sum.
I need to essentially measure how much each employee gets paid during each hour of work. There was some data cleaning to do and so I'm trying to make the formatting consistent.
It is a homework problem and its proving tough. I am new to python so please feel free to compress the code. I'm trying to use the pandas database.
csv file in pandas
break_notes end_time pay_rate start_time
0 15-18 23:00 10.0 10:00
1 18.30-19.00 23:00 12.0 18:00
2 4PM-5PM 22:30 14.0 12:00
3 3-4 18:00 10.0 09:00
4 4-4.10PM 23:00 20.0 09:00
5 15 - 17 23:00 10.0 11:00
6 11 - 13 16:00 10.0 10:00
'''
import pandas as pd
import datetime
import numpy as np
work_shifts = pd.read_csv('work_shifts.csv')
break_shifts = work_shifts['break_notes'].str.extract('(?P<start>[\d\.]+)?\D*(?P<end>[\d\.]+)?')
print(work_shifts)
for i in range(len(break_shifts['start'])):
if '.' not in break_shifts['start'][i]:
break_shifts['start'][i] = break_shifts['start'][i] + ':00'
else:
break_shifts['start'][i] = break_shifts['start'][i].replace('.',':')
for i in range(len(break_shifts['end'])):
if '.' in str(break_shifts['end'][i]):
break_shifts['end'][i] = break_shifts['end'][i].replace('.',':')
elif '.' not in str(break_shifts['end'][i]):
break_shifts['end'][i] = break_shifts['end'][i] + ':00'
for i in range(len(break_shifts['end'])):
break_shifts['end'][i] = datetime.datetime.strptime(break_shifts['end'][i], '%H:%M').time()
break_shifts['start'][i] = datetime.datetime.strptime(break_shifts['start'][i], '%H:%M').time()
work_shifts[['start_break','end_break']] = break_shifts[['start', 'end']]
for i in range(len(work_shifts['end_time'])):
work_shifts['end_time'][i] = datetime.datetime.strptime(work_shifts['end_time'][i], '%H:%M').time()
for i in range(len(work_shifts['start_time'])):
work_shifts['start_time'][i] = datetime.datetime.strptime(work_shifts['start_time'][i], '%H:%M').time()
print(work_shifts)
this is the result
break_notes end_time pay_rate start_time start_break end_break
0 15-18 23:00:00 10.0 10:00:00 15:00:00 18:00:00
1 18.30-19.00 23:00:00 12.0 18:00:00 18:30:00 19:00:00
2 4PM-5PM 22:30:00 14.0 12:00:00 04:00:00 05:00:00
3 3-4 18:00:00 10.0 09:00:00 03:00:00 04:00:00
4 4-4.10PM 23:00:00 20.0 09:00:00 04:00:00 04:10:00
5 15 - 17 23:00:00 10.0 11:00:00 15:00:00 17:00:00
6 11 - 13 16:00:00 10.0 10:00:00 11:00:00 13:00:00
I tried adding the times but they are inconsistent types. If theres a different approach then please provide guidance. I need to calculate how many employees are working at what time and then calculate how much pay is given to the employees per hour.
My approach was to convert the formatting of the break notes into time then convert the 12-hour to 12 provided both end_break and start_break was before datetime.datetime(12,0,0).
I'm not sure how to calculate the money per hour. Maybe using if statements?
Before everyone down votes, this is a tricky question to phrase in a single title. For a given timestamp, I want to round it to the previous 15 min when it's more than 10 mins away (i.e 11-15 mins). If it's less than 10 mins away I want to round that to the previous, previous 15 min.
This may be easier to display:
1st timestamp = 08:12:00. More than 10 mins so round to nearest 15 min = 08:00:00
2nd timestamp = 08:07:00. Less than 10 mins so round to the previous, previous 15 min = 7:45:00
I can round values greater than 10 mins easily enough. The ones less than 10 mins is where I'm struggling. I attempted to convert the timestamps to total seconds to determine if it's less than 600 seconds (10 mins). If less than 600 seconds I would take another 15 mins off. If more than 600 seconds I would leave as is. Below is my attempt.
import pandas as pd
from datetime import datetime, timedelta
d = ({
'Time' : ['8:10:00'],
})
df = pd.DataFrame(data=d)
df['Time'] = pd.to_datetime(df['Time'])
def hour_rounder(t):
return t.replace(second=0, microsecond=0, minute=(t.minute // 15 * 15), hour=t.hour)
FirstTime = df['Time'].iloc[0]
StartTime = hour_rounder(FirstTime)
#Strip date
FirstTime = datetime.time(FirstTime)
StartTime = datetime.time(StartTime)
#Convert timestamps to total seconds
def get_sec(time_str):
h, m, s = time_str.split(':')
return int(h) * 3600 + int(m) * 60 + int(s)
FirstTime = str(FirstTime)
FirstTime_secs = get_sec(FirstTime)
StartTime = str(StartTime)
StartTime_secs = get_sec(StartTime)
#Determine difference
diff = FirstTime_secs - StartTime_secs
If possible working with timedeltas first use to_timedelta, then Series.dt.floor and if modulo 15 is less or equal 10 remove 15 minutes:
d = {'Time': ['08:00:00', '08:01:00', '08:02:00', '08:03:00', '08:04:00',
'08:05:00', '08:06:00', '08:07:00', '08:08:00', '08:09:00',
'08:10:00', '08:11:00', '08:12:00', '08:13:00', '08:14:00',
'08:15:00', '08:16:00', '08:17:00', '08:18:00', '08:19:00',
'08:20:00', '08:21:00', '08:22:00', '08:23:00', '08:24:00',
'08:25:00', '08:26:00', '08:27:00', '08:28:00', '08:29:00',
'08:30:00', '08:31:00', '08:32:00', '08:33:00', '08:34:00',
'08:35:00', '08:36:00', '08:37:00', '08:38:00', '08:39:00']}
df = pd.DataFrame(d)
df['Time'] = pd.to_timedelta(df['Time'])
s = df['Time'].dt.floor(freq='15T')
#https://stackoverflow.com/a/14190143 for convert timedeltas to minutes
df['new'] = np.where(((df['Time'].dt.total_seconds() % 3600) // 60) % 15 <= 10,
s - pd.Timedelta(15 * 60, 's'), s)
print (df)
Time new
0 08:00:00 07:45:00
1 08:01:00 07:45:00
...
9 08:09:00 07:45:00
10 08:10:00 07:45:00
11 08:11:00 08:00:00
12 08:12:00 08:00:00
...
24 08:24:00 08:00:00
25 08:25:00 08:00:00
26 08:26:00 08:15:00
27 08:27:00 08:15:00
...
38 08:38:00 08:15:00
39 08:39:00 08:15:00
If need working with datetimes solution is similar with Series.dt.minute:
df = pd.DataFrame({'Time':pd.date_range('2015-01-01 08:00:00', freq='T', periods=40)})
s = df['Time'].dt.floor(freq='15T')
df['new'] = np.where(df['Time'].dt.minute % 15 <= 10, s - pd.Timedelta(15*60, 's'), s)
print (df)
Time new
0 2015-01-01 08:00:00 2015-01-01 07:45:00
1 2015-01-01 08:01:00 2015-01-01 07:45:00
...
9 2015-01-01 08:09:00 2015-01-01 07:45:00
10 2015-01-01 08:10:00 2015-01-01 07:45:00
11 2015-01-01 08:11:00 2015-01-01 08:00:00
12 2015-01-01 08:12:00 2015-01-01 08:00:00
13 2015-01-01 08:13:00 2015-01-01 08:00:00
...
24 2015-01-01 08:24:00 2015-01-01 08:00:00
25 2015-01-01 08:25:00 2015-01-01 08:00:00
26 2015-01-01 08:26:00 2015-01-01 08:15:00
27 2015-01-01 08:27:00 2015-01-01 08:15:00
...
38 2015-01-01 08:38:00 2015-01-01 08:15:00
39 2015-01-01 08:39:00 2015-01-01 08:15:00
Alternative solution from comment:
df['new1'] = df['Time'].sub(pd.Timedelta(11*60, 's')).dt.floor(freq='15T')
Here I want to calculate time interval in between rows in time column import from csv file. In my csv file it include date and time. Here I want to display time difference in between times in rows. That is my expected output.
My code is:-
def time_diff(start, end):
start.append(pd.to_datetime(data['time'][0],formt = '%H:%M:%S').time())
end.append(pd.to_datetime(len(data['time']), format='%H:%M:%S').time())
if isinstance(start, datetime_time): # convert to datetime
assert isinstance(end, datetime_time)
start, end = [datetime.combine(datetime.min,i) for t in [start, end]]
if start <= end:
return end - start
else:
end += timedelta(1) # +day
assert end > start
return end - start
for index, row in data.iterrows():
start = [datetime.strptime(t,'%H:%M:%S') for t in inex]
end = [datetime.strptime(t,'%H:%M:%S') for t in index]
print(time_diff(s, e))
assert time_diff(s, e) == time_diff(s.time(), e.time())
my csv file is:-
date time
10/3/2018 6:00:00
10/3/2018 7:00:00
10/3/2018 8:00:00
10/3/2018 9:00:00
10/3/2018 10:00:00
10/3/2018 11:00:00
10/3/2018 12:00:00
10/3/2018 13:45:00
10/3/2018 15:00:00
10/3/2018 16:00:00
10/3/2018 17:00:00
10/3/2018 18:00:00
10/3/2018 19:00:00
10/3/2018 20:00:00
10/3/2018 21:30:00
10/4/2018 6:00:00
My expected output (time difference) is:-
time_diff
0
1
1
1
1
1
1
1:45
1:15
1
1
1
1
1
1:30
8:30
This is the output that I want to display by using this code. But I don't know how to iterate through rows to take time difference in between two time. My time difference display in hour.
IIUC:
from io import StringIO
txtFile = StringIO("""date time
10/3/2018 6:00:00
10/3/2018 7:00:00
10/3/2018 8:00:00
10/3/2018 9:00:00
10/3/2018 10:00:00
10/3/2018 11:00:00
10/3/2018 12:00:00
10/3/2018 13:45:00
10/3/2018 15:00:00
10/3/2018 16:00:00
10/3/2018 17:00:00
10/3/2018 18:00:00
10/3/2018 19:00:00
10/3/2018 20:00:00
10/3/2018 21:30:00
10/4/2018 6:00:00""")
df = pd.read_csv(txtFile, sep='\t')
pd.to_datetime(df['date'] + ' ' + df['time']).diff().fillna(0)
Output:
0 00:00:00
1 01:00:00
2 01:00:00
3 01:00:00
4 01:00:00
5 01:00:00
6 01:00:00
7 01:45:00
8 01:15:00
9 01:00:00
10 01:00:00
11 01:00:00
12 01:00:00
13 01:00:00
14 01:30:00
15 08:30:00
dtype: timedelta64[ns]
1) Read your csv (with header and tab-separated?) into a pandas dataframe:
import pandas as pd
df = pd.read_csv('your_file.csv', header=0, sep='\t')
2) If done correctly, you would now have a dataframe with a date column and a time column. Create a pandas datetime column out of these two:
df['date_time'] = pd.to_datetime(df['date'] + ' ' + df['time'])
3) Get the date_time of the row above with shift() and calculate the difference between the date_time value of this row and its row above:
df['time_diff'] = df['date_time'] - df['date_time'].shift()
4) The first value is a NaT (not a time value) since it has no cell above. Fill this value with a 0.
df['time_diff'].fillna(0, inplace=True)