I need to essentially measure how much each employee gets paid during each hour of work. There was some data cleaning to do and so I'm trying to make the formatting consistent.
It is a homework problem and its proving tough. I am new to python so please feel free to compress the code. I'm trying to use the pandas database.
csv file in pandas
break_notes end_time pay_rate start_time
0 15-18 23:00 10.0 10:00
1 18.30-19.00 23:00 12.0 18:00
2 4PM-5PM 22:30 14.0 12:00
3 3-4 18:00 10.0 09:00
4 4-4.10PM 23:00 20.0 09:00
5 15 - 17 23:00 10.0 11:00
6 11 - 13 16:00 10.0 10:00
'''
import pandas as pd
import datetime
import numpy as np
work_shifts = pd.read_csv('work_shifts.csv')
break_shifts = work_shifts['break_notes'].str.extract('(?P<start>[\d\.]+)?\D*(?P<end>[\d\.]+)?')
print(work_shifts)
for i in range(len(break_shifts['start'])):
if '.' not in break_shifts['start'][i]:
break_shifts['start'][i] = break_shifts['start'][i] + ':00'
else:
break_shifts['start'][i] = break_shifts['start'][i].replace('.',':')
for i in range(len(break_shifts['end'])):
if '.' in str(break_shifts['end'][i]):
break_shifts['end'][i] = break_shifts['end'][i].replace('.',':')
elif '.' not in str(break_shifts['end'][i]):
break_shifts['end'][i] = break_shifts['end'][i] + ':00'
for i in range(len(break_shifts['end'])):
break_shifts['end'][i] = datetime.datetime.strptime(break_shifts['end'][i], '%H:%M').time()
break_shifts['start'][i] = datetime.datetime.strptime(break_shifts['start'][i], '%H:%M').time()
work_shifts[['start_break','end_break']] = break_shifts[['start', 'end']]
for i in range(len(work_shifts['end_time'])):
work_shifts['end_time'][i] = datetime.datetime.strptime(work_shifts['end_time'][i], '%H:%M').time()
for i in range(len(work_shifts['start_time'])):
work_shifts['start_time'][i] = datetime.datetime.strptime(work_shifts['start_time'][i], '%H:%M').time()
print(work_shifts)
this is the result
break_notes end_time pay_rate start_time start_break end_break
0 15-18 23:00:00 10.0 10:00:00 15:00:00 18:00:00
1 18.30-19.00 23:00:00 12.0 18:00:00 18:30:00 19:00:00
2 4PM-5PM 22:30:00 14.0 12:00:00 04:00:00 05:00:00
3 3-4 18:00:00 10.0 09:00:00 03:00:00 04:00:00
4 4-4.10PM 23:00:00 20.0 09:00:00 04:00:00 04:10:00
5 15 - 17 23:00:00 10.0 11:00:00 15:00:00 17:00:00
6 11 - 13 16:00:00 10.0 10:00:00 11:00:00 13:00:00
I tried adding the times but they are inconsistent types. If theres a different approach then please provide guidance. I need to calculate how many employees are working at what time and then calculate how much pay is given to the employees per hour.
My approach was to convert the formatting of the break notes into time then convert the 12-hour to 12 provided both end_break and start_break was before datetime.datetime(12,0,0).
I'm not sure how to calculate the money per hour. Maybe using if statements?
Related
In my dataframe I have a column that is timestamp formatted as 2021-11-18 00:58:22.705
I wish to create a column that displays the time elapsed from each row to the interval time (first time).
There are 2 ways in which I can think of doing this but I don't seem to know how to make it happen.
Method 1:
To subtract each time stamp to the row above.
df["difference"]= df["timestamp"].diff()
Now that this time difference has been calculated I would like to create another column that sums each time difference but it keeps the sum from the delta above (elapsed time from start of process)
Method 2:
I guess another way would be to calculate the timestamp of each row to the interval time stamp (first one)
I do not know how I would do that.
Thanks in advance.
not completely understood the type of difference needed so adding both which I think are reasonable:
import pandas as pd
times = pd.date_range('2022-05-23', periods=20, freq='0D30min')
df = pd.DataFrame({'Timestamp': times})
df['difference_in_min'] = (df.Timestamp - df.Timestamp.min()).astype('timedelta64[m]')
df['cumulative_dif_in_min'] = df.difference_in_min.cumsum()
print(df)
Timestamp difference_in_min cumulative_dif_in_min
0 2022-05-23 00:00:00 0.0 0.0
1 2022-05-23 00:30:00 30.0 30.0
2 2022-05-23 01:00:00 60.0 90.0
3 2022-05-23 01:30:00 90.0 180.0
4 2022-05-23 02:00:00 120.0 300.0
5 2022-05-23 02:30:00 150.0 450.0
6 2022-05-23 03:00:00 180.0 630.0
7 2022-05-23 03:30:00 210.0 840.0
8 2022-05-23 04:00:00 240.0 1080.0
I am trying to visualise rain events using a data contained in a dataframe.
the idea seems very simple, but the execution seems to be impossible!
here is a part of the dataframe:
start_time end_time duration br_open_total
0 2022-01-01 10:00:00 2022-01-01 19:00:00 9.0 0.2540000563879943
1 2022-01-02 00:00:00 2022-01-02 10:00:00 10.0 1.0160002255520624
2 2022-01-02 17:00:00 2022-01-03 02:00:00 9.0 0.7620001691640113
3 2022-01-03 02:00:00 2022-01-04 12:00:00 34.0 10.668002368296513
4 2022-01-07 21:00:00 2022-01-08 06:00:00 9.0 0.2540000563879943
5 2022-01-16 05:00:00 2022-01-16 20:00:00 15.0 0.5080001127760454
6 2022-01-19 04:00:00 2022-01-19 17:00:00 13.0 0.7620001691640255
7 2022-01-21 14:00:00 2022-01-22 00:00:00 10.0 1.5240003383280751
8 2022-01-27 02:00:00 2022-01-27 16:00:00 14.0 3.0480006766561503
9 2022-02-01 12:00:00 2022-02-01 21:00:00 9.0 0.2540000563880126
10 2022-02-03 05:00:00 2022-02-03 15:00:00 10.0 0.5080001127760251
What I want to do is have a plot with time on the x axis, and the value of the 'br_open_total' on the y axis.
I can draw what I want it to look like, see below:
I apologise for the simplicity of the drawing, but I think it explains what I want to do.
How do I do this, and then repeat for other dataframes on the same plot.
I have tried staircase, matplotlib.pyplot.stair and others with no success.
It seems such a simple concept!
Edit 1:
Tried Joswin K J's answer with the actual data, and got this:
The event at 02-12 11:00 should be 112 hours duration, but the bar is the same width as all the others.
Edit2:
Tried Mozway's answer and got this:
Still doesn't show width of each event, and doesn't discretise the events either
Edit 3:
Using Mozway's amended answer I get this plot for the actual data:
I have added the cursor position using paint, at the top right of the plot you can see that the cursor is at 2022-02-09 and 20.34, which is actually the value for 2022-02-01, so it seems that the plot is shifted to the left by one data point?, also the large block between 2022-3-01 and 2022-04-03 doesn't seem to be in the data
edit 4: as requested by Mozway
Reshaped Data
duration br_open_total variable date
0 10.0 1.0160002255520624 start_time 2022-01-02 00:00:00
19 10.0 0.0 end_time 2022-01-02 10:00:00
1 9.0 0.7620001691640113 start_time 2022-01-02 17:00:00
2 34.0 10.668002368296513 start_time 2022-01-03 02:00:00
21 34.0 0.0 end_time 2022-01-04 12:00:00
3 15.0 0.5080001127760454 start_time 2022-01-16 05:00:00
22 15.0 0.0 end_time 2022-01-16 20:00:00
4 13.0 0.7620001691640255 start_time 2022-01-19 04:00:00
23 13.0 0.0 end_time 2022-01-19 17:00:00
5 10.0 1.5240003383280751 start_time 2022-01-21 14:00:00
24 10.0 0.0 end_time 2022-01-22 00:00:00
6 14.0 3.0480006766561503 start_time 2022-01-27 02:00:00
25 14.0 0.0 end_time 2022-01-27 16:00:00
7 10.0 0.5080001127760251 start_time 2022-02-03 05:00:00
26 10.0 0.0 end_time 2022-02-03 15:00:00
8 18.0 7.366001635252363 start_time 2022-02-03 23:00:00
27 18.0 0.0 end_time 2022-02-04 17:00:00
9 13.0 2.28600050749211 start_time 2022-02-05 11:00:00
28 13.0 0.0 end_time 2022-02-06 00:00:00
10 19.0 2.2860005074921173 start_time 2022-02-06 04:00:00
29 19.0 0.0 end_time 2022-02-06 23:00:00
11 13.0 1.2700002819400584 start_time 2022-02-07 11:00:00
30 13.0 0.0 end_time 2022-02-08 00:00:00
12 12.0 2.79400062026814 start_time 2022-02-09 01:00:00
31 12.0 0.0 end_time 2022-02-09 13:00:00
13 112.0 20.320004511041 start_time 2022-02-12 11:00:00
32 112.0 0.0 end_time 2022-02-17 03:00:00
14 28.0 2.0320004511041034 start_time 2022-02-18 14:00:00
33 28.0 0.0 end_time 2022-02-19 18:00:00
15 17.0 17.272003834384847 start_time 2022-02-23 17:00:00
34 17.0 0.0 end_time 2022-02-24 10:00:00
16 9.0 0.7620001691640397 start_time 2022-02-27 13:00:00
35 9.0 0.0 end_time 2022-02-27 22:00:00
17 18.0 4.0640009022082 start_time 2022-04-04 00:00:00
36 18.0 0.0 end_time 2022-04-04 18:00:00
18 15.0 1.0160002255520482 start_time 2022-04-06 05:00:00
37 15.0 0.0 end_time 2022-04-06 20:00:00
when plotted using
plt.step(bdf2['date'], bdf2['br_open_total'])
plt.gcf().set_size_inches(10, 4)
plt.xticks(rotation=90)
produces the plot shown above, in which the top left corner of a block corresponds to the previous data point.
edit 5: further info
When I plot all my dataframes (different sensors) I get the same differential on the event start and end times?
You can use a step plot:
# ensure datetime
df['start_time'] = pd.to_datetime(df['start_time'])
df['end_time'] = pd.to_datetime(df['end_time'])
# reshape the data
df2 = (df
.melt(id_vars=['duration', 'br_open_total'], value_name='date')
.sort_values(by='date')
.drop_duplicates(subset='date')
.assign(br_open_total=lambda d: d['br_open_total'].mask(d['variable'].eq('end_time'), 0))
)
# plot
import matplotlib.pyplot as plt
plt.step(df2['date'], df2['br_open_total'])
plt.gcf().set_size_inches(10, 4)
output:
reshaped data:
duration br_open_total variable date
0 9.0 0.254000 start_time 2022-01-01 10:00:00
11 9.0 0.000000 end_time 2022-01-01 19:00:00
1 10.0 1.016000 start_time 2022-01-02 00:00:00
12 10.0 0.000000 end_time 2022-01-02 10:00:00
2 9.0 0.762000 start_time 2022-01-02 17:00:00
3 34.0 10.668002 start_time 2022-01-03 02:00:00
14 34.0 0.000000 end_time 2022-01-04 12:00:00
4 9.0 0.254000 start_time 2022-01-07 21:00:00
15 9.0 0.000000 end_time 2022-01-08 06:00:00
5 15.0 0.508000 start_time 2022-01-16 05:00:00
16 15.0 0.000000 end_time 2022-01-16 20:00:00
6 13.0 0.762000 start_time 2022-01-19 04:00:00
17 13.0 0.000000 end_time 2022-01-19 17:00:00
7 10.0 1.524000 start_time 2022-01-21 14:00:00
18 10.0 0.000000 end_time 2022-01-22 00:00:00
8 14.0 3.048001 start_time 2022-01-27 02:00:00
19 14.0 0.000000 end_time 2022-01-27 16:00:00
9 9.0 0.254000 start_time 2022-02-01 12:00:00
20 9.0 0.000000 end_time 2022-02-01 21:00:00
10 10.0 0.508000 start_time 2022-02-03 05:00:00
21 10.0 0.000000 end_time 2022-02-03 15:00:00
Try this:
import matplotlib.pyplot as plt
for ind,row in df.iterrows():
plt.plot(pd.Series([row['start_time'],row['end_time']]),pd.Series([row['br_open_total'],row['br_open_total']]),color='b')
plt.plot(pd.Series([row['start_time'],row['start_time']]),pd.Series([0,row['br_open_total']]),color='b')
plt.plot(pd.Series([row['end_time'],row['end_time']]),pd.Series([0,row['br_open_total']]),color='b')
plt.xticks(rotation=90)
Result:
I believe I have now cracked it, with a great debt of thanks to #Mozway.
The code to restructure the dataframe for plotting:
#create dataframes of each open gauge events removing any event with an open total of less than 0.254mm
#bresser/open
bdftdf=bdf.loc[bdf['br_open_total'] > 0.255]
bdftdf=bdftdf.copy()
bdftdf['start_time'] = pd.to_datetime(bdftdf['start_time'])
bdftdf['end_time'] = pd.to_datetime(bdftdf['end_time'])
bdf2 = (bdftdf
.melt(id_vars=['duration', 'ic_total','mc_total','md_total','imd_total','oak_total','highpoint_total','school_total','br_open_total',
'fr_gauge_total','open_mean_total','br_open_ic_%_int','br_open_mc_%_int','br_open_md_%_int','br_open_imd_%_int',
'br_open_oak_%_int'], value_name='date')
.sort_values(by='date')
#.drop_duplicates(subset='date')
.assign(br_open_total=lambda d: d['br_open_total'].mask(d['variable'].eq('end_time'), 0))
)
#create array for stairs plot
bdfarr=np.array(bdf2['date'])
bl=len(bdf2)
bdfarr=np.append(bdfarr,[bdfarr[bl-1]+np.timedelta64(1,'h')])
Rather than use the plt.step plot as suggested by Mozway, I have used plt.stairs, after creating an array of the 'date' column in the dataframe and appending an extra element to that array equal to the last element =1hour.
This means that the data now plots as I had intended it to.:
code for plot:
fig1=plt.figure()
plt.stairs(bdf2['br_open_total'], bdfarr, label='Bresser\Open')
plt.stairs(frdf2['fr_gauge_total'], frdfarr, label='FR Gauge')
plt.stairs(hpdf2['highpoint_total'], hpdfarr, label='Highpoint')
plt.stairs(schdf2['school_total'], schdfarr, label='School')
plt.stairs(opmedf2['open_mean_total'], opmedfarr, label='Open mean')
plt.xticks(rotation=90)
plt.legend(title='Rain events', loc='best')
plt.show()
I have the following table:
Hora_Retiro count_uses
0 00:00:18 1
1 00:00:34 1
2 00:02:27 1
3 00:03:13 1
4 00:06:45 1
... ... ...
748700 23:58:47 1
748701 23:58:49 1
748702 23:59:11 1
748703 23:59:47 1
748704 23:59:56 1
And I want to group all values within each hour, so I can see the total number of uses per hour (00:00:00 - 23:00:00)
I have the following code:
hora_pico_aug= hora_pico.groupby(pd.Grouper(key="Hora_Retiro",freq='H')).count()
Hora_Retiro column is of timedelta64[ns] type
Which gives the following output:
count_uses
Hora_Retiro
00:00:02 2566
01:00:02 602
02:00:02 295
03:00:02 5
04:00:02 10
05:00:02 4002
06:00:02 16075
07:00:02 39410
08:00:02 76272
09:00:02 56721
10:00:02 36036
11:00:02 32011
12:00:02 33725
13:00:02 41032
14:00:02 50747
15:00:02 50338
16:00:02 42347
17:00:02 54674
18:00:02 76056
19:00:02 57958
20:00:02 34286
21:00:02 22509
22:00:02 13894
23:00:02 7134
However, the index column starts at 00:00:02, and I want it to start at 00:00:00, and then go from one hour intervals. Something like this:
count_uses
Hora_Retiro
00:00:00 2565
01:00:00 603
02:00:00 295
03:00:00 5
04:00:00 10
05:00:00 4002
06:00:00 16075
07:00:00 39410
08:00:00 76272
09:00:00 56721
10:00:00 36036
11:00:00 32011
12:00:00 33725
13:00:00 41032
14:00:00 50747
15:00:00 50338
16:00:00 42347
17:00:00 54674
18:00:00 76056
19:00:00 57958
20:00:00 34286
21:00:00 22509
22:00:00 13894
23:00:00 7134
How can i make it to start at 00:00:00??
Thanks for the help!
You can create an hour column from Hora_Retiro column.
df['hour'] = df['Hora_Retiro'].dt.hour
And then groupby on the basis of hour
gpby_df = df.groupby('hour')['count_uses'].sum().reset_index()
gpby_df['hour'] = pd.to_datetime(gpby_df['hour'], format='%H').dt.time
gpby_df.columns = ['Hora_Retiro', 'sum_count_uses']
gpby_df
gives
Hora_Retiro sum_count_uses
0 00:00:00 14
1 09:00:00 1
2 10:00:00 2
3 20:00:00 2
I assume that Hora_Retiro column in your DataFrame is of
Timedelta type. It is not datetime, as in this case there
would be printed also the date part.
Indeed, your code creates groups starting at the minute / second
taken from the first row.
To group by "full hours":
round each element in this column to hour,
then group (just by this rounded value).
The code to do it is:
hora_pico.groupby(hora_pico.Hora_Retiro.apply(
lambda tt: tt.round('H'))).count_uses.count()
However I advise you to make up your mind, what do you want to count:
rows or values in count_uses column.
In the second case replace count function with sum.
I have two data frames like following, data frame A has datetime even with minutes, data frame B only has hour.
df:A
dataDate original
2018-09-30 11:20:00 3
2018-10-01 12:40:00 10
2018-10-02 07:00:00 5
2018-10-27 12:50:00 5
2018-11-28 19:45:00 7
df:B
dataDate count
2018-09-30 10:00:00 300
2018-10-01 12:00:00 50
2018-10-02 07:00:00 120
2018-10-27 12:00:00 234
2018-11-28 19:05:00 714
I like to merge the two on the basis of hour date and hour, so that now in dataframe A should have all the rows filled on the basis of merge on date and hour
I can try to do it via
A['date'] = A.dataDate.date
B['date'] = B.dataDate.date
A['hour'] = A.dataDate.hour
B['hour'] = B.dataDate.hour
and then merge
merge_df = pd.merge(A,B, how='left', left_on=['date', 'hour'],
right_on=['date', 'hour'])
but its a very long process, Is their an efficient way to perform the same operation with the help of pandas time series or date functionality?
Use map if need append only one column from B to A with floor for set minutes and seconds if exist to 0:
d = dict(zip(B.dataDate.dt.floor('H'), B['count']))
A['count'] = A.dataDate.dt.floor('H').map(d)
print (A)
dataDate original count
0 2018-09-30 11:20:00 3 NaN
1 2018-10-01 12:40:00 10 50.0
2 2018-10-02 07:00:00 5 120.0
3 2018-10-27 12:50:00 5 234.0
4 2018-11-28 19:45:00 7 714.0
For general solution use DataFrame.join:
A.index = A.dataDate.dt.floor('H')
B.index = B.dataDate.dt.floor('H')
A = A.join(B, lsuffix='_left')
print (A)
dataDate_left original dataDate count
dataDate
2018-09-30 11:00:00 2018-09-30 11:20:00 3 NaT NaN
2018-10-01 12:00:00 2018-10-01 12:40:00 10 2018-10-01 12:00:00 50.0
2018-10-02 07:00:00 2018-10-02 07:00:00 5 2018-10-02 07:00:00 120.0
2018-10-27 12:00:00 2018-10-27 12:50:00 5 2018-10-27 12:00:00 234.0
2018-11-28 19:00:00 2018-11-28 19:45:00 7 2018-11-28 19:05:00 714.0
I have raw data like this want to find the difference between this two time in mint .....problem is data which is in data frame...
source:
start time end time
0 08:30:00 17:30:00
1 11:00:00 17:30:00
2 08:00:00 21:30:00
3 19:30:00 22:00:00
4 19:00:00 00:00:00
5 08:30:00 15:30:00
Need a output like this:
duration
540mint
798mint
162mint
1140mint
420mint
Your expected output seems to be incorrect. That aside, we can use base R's difftime:
transform(
df,
duration = difftime(
strptime(end.time, format = "%H:%M:%S"),
strptime(start.time, format = "%H:%M:%S"),
units = "mins"))
# start.time end.time duration
#0 08:30:00 17:30:00 540 mins
#1 11:00:00 17:30:00 390 mins
#2 08:00:00 21:30:00 810 mins
#3 19:30:00 22:00:00 150 mins
#4 19:00:00 00:00:00 -1140 mins
#5 08:30:00 15:30:00 420 mins
or as a difftime vector
with(df, difftime(
strptime(end.time, format = "%H:%M:%S"),
strptime(start.time, format = "%H:%M:%S"),
units = "mins"))
#Time differences in mins
#[1] 540 390 810 150 -1140 420
Sample data
df <- read.table(text =
" 'start time' 'end time'
0 08:30:00 17:30:00
1 11:00:00 17:30:00
2 08:00:00 21:30:00
3 19:30:00 22:00:00
4 19:00:00 00:00:00
5 08:30:00 15:30:00", header = T, row.names = 1)
import pandas as pd
df = pd.DataFrame({'start time':['08:30:00','11:00:00','08:00:00','19:30:00','19:00:00','08:30:00'],'end time':['17:30:00','17:30:00','21:30:00','22:00:00','00:00:00','15:30:00']},columns=['start time','end time'])
df
Out[355]:
start time end time
0 08:30:00 17:30:00
1 11:00:00 17:30:00
2 08:00:00 21:30:00
3 19:30:00 22:00:00
4 19:00:00 00:00:00
5 08:30:00 15:30:00
(pd.to_datetime(df['end time']) - pd.to_datetime(df['start time'])).dt.seconds/60
Out[356]:
0 540.0
1 390.0
2 810.0
3 150.0
4 300.0
5 420.0
dtype: float64
Yes, definitely datetime is what you need here. Specifically, the strptime function, which parses a string into a time object.
from datetime import datetime
s1 = '10:33:26'
s2 = '11:15:49' # for example
FMT = '%H:%M:%S'
tdelta = datetime.strptime(s2, FMT) - datetime.strptime(s1, FMT)
That gets you a timedelta object that contains the difference between the two times. You can do whatever you want with that, e.g. converting it to seconds or adding it to another datetime.
This will return a negative result if the end time is earlier than the start time, for example s1 = 12:00:00 and s2 = 05:00:00. If you want the code to assume the interval crosses midnight in this case (i.e. it should assume the end time is never earlier than the start time), you can add the following lines to the above code:
if tdelta.days < 0:
tdelta = timedelta(days=0,
seconds=tdelta.seconds, microseconds=tdelta.microseconds)
(of course you need to include from datetime import timedelta somewhere). Thanks to J.F. Sebastian for pointing out this use case.