I am trying to calculate the total cost of staffing requirements over a day. My attempt is to group People required throughout the day and multiply the cost. I then try to group this cost per/hour. But my output isn't correct.
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.dates as dates
d = ({
'Time' : ['0/1/1900 8:00:00','0/1/1900 9:59:00','0/1/1900 10:00:00','0/1/1900 12:29:00','0/1/1900 12:30:00','0/1/1900 13:00:00','0/1/1900 13:02:00','0/1/1900 13:15:00','0/1/1900 13:20:00','0/1/1900 18:10:00','0/1/1900 18:15:00','0/1/1900 18:20:00','0/1/1900 18:25:00','0/1/1900 18:45:00','0/1/1900 18:50:00','0/1/1900 19:05:00','0/1/1900 19:07:00','0/1/1900 21:57:00','0/1/1900 22:00:00','0/1/1900 22:30:00','0/1/1900 22:35:00','1/1/1900 3:00:00','1/1/1900 3:05:00','1/1/1900 3:20:00','1/1/1900 3:25:00'],
'People' : [1,1,2,2,3,3,2,2,3,3,4,4,3,3,2,2,3,3,4,4,3,3,2,2,1],
})
df = pd.DataFrame(data = d)
df['Time'] = ['/'.join([str(int(x.split('/')[0])+1)] + x.split('/')[1:]) for x in df['Time']]
df['Time'] = pd.to_datetime(df['Time'], format='%d/%m/%Y %H:%M:%S')
formatter = dates.DateFormatter('%Y-%m-%d %H:%M:%S')
df = df.groupby(pd.Grouper(freq='15T',key='Time'))['People'].max().ffill()
df = df.reset_index(level=['Time'])
df['Cost'] = df['People'] * 26
cost = df.groupby([df['Time'].dt.hour])['Cost'].sum()
#For reference. This plot displays people required throughout the day
fig, ax = plt.subplots(figsize = (10,5))
plt.plot(df['Time'], df['People'], color = 'blue')
plt.locator_params(axis='y', nbins=6)
ax.xaxis.set_major_formatter(formatter)
ax.xaxis.set_major_formatter(dates.DateFormatter('%H:%M:%S'))
plt.ylabel('People Required', labelpad = 10)
plt.xlabel('Time', labelpad = 10)
print(cost)
Out:
0 416.0
1 416.0
2 416.0
3 130.0
8 104.0
9 104.0
10 208.0
11 208.0
12 260.0
13 312.0
14 312.0
15 312.0
16 312.0
17 312.0
18 364.0
19 312.0
20 312.0
21 312.0
22 416.0
23 416.0
I have done the calculations manually an the total cost output should be:
$1456
I think the wrong numbers in your question is most likely caused by the incorrect datetime values that you have. Once you have fixed that, you should get the correct numbers. Here's an attempt from my end, with a little tweak to the Time column.
import pandas as pd
df = pd.DataFrame({
'Time' : ['1/1/1900 8:00:00','1/1/1900 9:59:00','1/1/1900 10:00:00','1/1/1900 12:29:00','1/1/1900 12:30:00','1/1/1900 13:00:00','1/1/1900 13:02:00','1/1/1900 13:15:00','1/1/1900 13:20:00','1/1/1900 18:10:00','1/1/1900 18:15:00','1/1/1900 18:20:00','1/1/1900 18:25:00','1/1/1900 18:45:00','1/1/1900 18:50:00','1/1/1900 19:05:00','1/1/1900 19:07:00','1/1/1900 21:57:00','1/1/1900 22:00:00','1/1/1900 22:30:00','1/1/1900 22:35:00','1/2/1900 3:00:00','1/2/1900 3:05:00','1/2/1900 3:20:00','1/2/1900 3:25:00'],
'People' : [1,1,2,2,3,3,2,2,3,3,4,4,3,3,2,2,3,3,4,4,3,3,2,2,1],
})
>>>df
Time People
0 1/1/1900 8:00:00 1
1 1/1/1900 9:59:00 1
2 1/1/1900 10:00:00 2
3 1/1/1900 12:29:00 2
4 1/1/1900 12:30:00 3
5 1/1/1900 13:00:00 3
6 1/1/1900 13:02:00 2
7 1/1/1900 13:15:00 2
8 1/1/1900 13:20:00 3
9 1/1/1900 18:10:00 3
10 1/1/1900 18:15:00 4
11 1/1/1900 18:20:00 4
12 1/1/1900 18:25:00 3
13 1/1/1900 18:45:00 3
14 1/1/1900 18:50:00 2
15 1/1/1900 19:05:00 2
16 1/1/1900 19:07:00 3
17 1/1/1900 21:57:00 3
18 1/1/1900 22:00:00 4
19 1/1/1900 22:30:00 4
20 1/1/1900 22:35:00 3
21 1/2/1900 3:00:00 3
22 1/2/1900 3:05:00 2
23 1/2/1900 3:20:00 2
24 1/2/1900 3:25:00 1
df.Time = pd.to_datetime(df.Time)
df.Time.set_index('Time', inplace=True)
df_group = df.resample('15T').max().ffill()
df_hour = df_group.resample('1h').max()
df_hour['Cost'] = df_hour['People'] * 26
>>>df_hour
People Cost
Time
1900-01-01 08:00:00 1.0 26.0
1900-01-01 09:00:00 1.0 26.0
1900-01-01 10:00:00 2.0 52.0
1900-01-01 11:00:00 2.0 52.0
1900-01-01 12:00:00 3.0 78.0
1900-01-01 13:00:00 3.0 78.0
1900-01-01 14:00:00 3.0 78.0
1900-01-01 15:00:00 3.0 78.0
1900-01-01 16:00:00 3.0 78.0
1900-01-01 17:00:00 3.0 78.0
1900-01-01 18:00:00 4.0 104.0
1900-01-01 19:00:00 3.0 78.0
1900-01-01 20:00:00 3.0 78.0
1900-01-01 21:00:00 3.0 78.0
1900-01-01 22:00:00 4.0 104.0
1900-01-01 23:00:00 4.0 104.0
1900-01-02 00:00:00 4.0 104.0
1900-01-02 01:00:00 4.0 104.0
1900-01-02 02:00:00 4.0 104.0
1900-01-02 03:00:00 3.0 78.0
>>>df_hour.sum()
People 60.0
Cost 1560.0
dtype: float64
Edit: Took me reading the second time to realize the methodology that you're using. The incorrect number that you got is likely due to grouping by sum() after you performed a ffill() on your aggregated People column. Since ffill() fills the holes from the last valid value, you actually overestimated your cost for these periods. You should be using max() again, to find the maximum number of headcount required for that hour.
Related
I have a large dataset which contains a date column that covers from the year 2019. Now I do want to generate number of weeks on a separate column that are contained in those dates.
Here is how the date column looks like:
import pandas as pd
data = {'date': ['2019-09-10', 'NaN', '2019-10-07', '2019-11-04', '2019-11-28',
'2019-12-02', '2020-01-24', '2020-01-29', '2020-02-05',
'2020-02-12', '2020-02-14', '2020-02-24', '2020-03-11',
'2020-03-16', '2020-03-17', '2020-03-18', '2021-09-14',
'2021-09-30', '2021-10-07', '2021-10-08', '2021-10-12',
'2021-10-14', '2021-10-15', '2021-10-19', '2021-10-21',
'2021-10-26', '2021-10-28', '2021-10-29', '2021-11-02',
'2021-11-15', '2021-11-16', '2021-12-01', '2021-12-07',
'2021-12-09', '2021-12-10', '2021-12-14', '2021-12-15',
'2022-01-13', '2022-01-14', '2022-01-21', '2022-01-24',
'2022-01-25', '2022-01-27', '2022-01-31', '2022-02-01',
'2022-02-10', '2022-02-11', '2022-02-16', '2022-02-24']}
df = pd.DataFrame(data)
Now as from the first day this data was collected, I want to count 7 days using the date column and create a week out it. an example if the first week contains the 7 dates, I create a column and call it week one. I want to do the same process until the last week the data was collected.
Maybe it will be a good idea to organize the dates in order as from the first date to current one.
I have tried this but its not generating weeks in order, it actually has repetitive weeks.
pd.to_datetime(df['date'], errors='coerce').dt.week
My intention is, as from the first date the date was collected, count 7 days and store that as week one then continue incrementally until the last week say week number 66.
Here is the expected column of weeks created from the date column
import pandas as pd
week_df = {'weeks': ['1', '2', "3", "5", '6']}
df_weeks = pd.DataFrame(week_df)
IIUC use:
df['date'] = pd.to_datetime(df['date'])
df['week'] = df['date'].sub(df['date'].iat[0]).dt.days // 7 + 1
print (df.head(10))
date week
0 2019-09-10 1.0
1 NaT NaN
2 2019-10-07 4.0
3 2019-11-04 8.0
4 2019-11-28 12.0
5 2019-12-02 12.0
6 2020-01-24 20.0
7 2020-01-29 21.0
8 2020-02-05 22.0
9 2020-02-12 23.0
You have more than 66 weeks here, so either you want the real week count since the beginning or you want a dummy week rank. See below for both solutions:
# convert to week period
s = pd.to_datetime(df['date']).dt.to_period('W')
# get real week number
df['week'] = s.sub(s.iloc[0]).dropna().apply(lambda x: x.n).add(1)
# get dummy week rank
df['week2'] = s.rank(method='dense')
output:
date week week2
0 2019-09-10 1.0 1.0
1 NaN NaN NaN
2 2019-10-07 5.0 2.0
3 2019-11-04 9.0 3.0
4 2019-11-28 12.0 4.0
5 2019-12-02 13.0 5.0
6 2020-01-24 20.0 6.0
7 2020-01-29 21.0 7.0
8 2020-02-05 22.0 8.0
9 2020-02-12 23.0 9.0
10 2020-02-14 23.0 9.0
11 2020-02-24 25.0 10.0
12 2020-03-11 27.0 11.0
13 2020-03-16 28.0 12.0
14 2020-03-17 28.0 12.0
15 2020-03-18 28.0 12.0
16 2021-09-14 106.0 13.0
17 2021-09-30 108.0 14.0
18 2021-10-07 109.0 15.0
19 2021-10-08 109.0 15.0
...
42 2022-01-27 125.0 26.0
43 2022-01-31 126.0 27.0
44 2022-02-01 126.0 27.0
45 2022-02-10 127.0 28.0
46 2022-02-11 127.0 28.0
47 2022-02-16 128.0 29.0
48 2022-02-24 129.0 30.0
I am trying to visualise rain events using a data contained in a dataframe.
the idea seems very simple, but the execution seems to be impossible!
here is a part of the dataframe:
start_time end_time duration br_open_total
0 2022-01-01 10:00:00 2022-01-01 19:00:00 9.0 0.2540000563879943
1 2022-01-02 00:00:00 2022-01-02 10:00:00 10.0 1.0160002255520624
2 2022-01-02 17:00:00 2022-01-03 02:00:00 9.0 0.7620001691640113
3 2022-01-03 02:00:00 2022-01-04 12:00:00 34.0 10.668002368296513
4 2022-01-07 21:00:00 2022-01-08 06:00:00 9.0 0.2540000563879943
5 2022-01-16 05:00:00 2022-01-16 20:00:00 15.0 0.5080001127760454
6 2022-01-19 04:00:00 2022-01-19 17:00:00 13.0 0.7620001691640255
7 2022-01-21 14:00:00 2022-01-22 00:00:00 10.0 1.5240003383280751
8 2022-01-27 02:00:00 2022-01-27 16:00:00 14.0 3.0480006766561503
9 2022-02-01 12:00:00 2022-02-01 21:00:00 9.0 0.2540000563880126
10 2022-02-03 05:00:00 2022-02-03 15:00:00 10.0 0.5080001127760251
What I want to do is have a plot with time on the x axis, and the value of the 'br_open_total' on the y axis.
I can draw what I want it to look like, see below:
I apologise for the simplicity of the drawing, but I think it explains what I want to do.
How do I do this, and then repeat for other dataframes on the same plot.
I have tried staircase, matplotlib.pyplot.stair and others with no success.
It seems such a simple concept!
Edit 1:
Tried Joswin K J's answer with the actual data, and got this:
The event at 02-12 11:00 should be 112 hours duration, but the bar is the same width as all the others.
Edit2:
Tried Mozway's answer and got this:
Still doesn't show width of each event, and doesn't discretise the events either
Edit 3:
Using Mozway's amended answer I get this plot for the actual data:
I have added the cursor position using paint, at the top right of the plot you can see that the cursor is at 2022-02-09 and 20.34, which is actually the value for 2022-02-01, so it seems that the plot is shifted to the left by one data point?, also the large block between 2022-3-01 and 2022-04-03 doesn't seem to be in the data
edit 4: as requested by Mozway
Reshaped Data
duration br_open_total variable date
0 10.0 1.0160002255520624 start_time 2022-01-02 00:00:00
19 10.0 0.0 end_time 2022-01-02 10:00:00
1 9.0 0.7620001691640113 start_time 2022-01-02 17:00:00
2 34.0 10.668002368296513 start_time 2022-01-03 02:00:00
21 34.0 0.0 end_time 2022-01-04 12:00:00
3 15.0 0.5080001127760454 start_time 2022-01-16 05:00:00
22 15.0 0.0 end_time 2022-01-16 20:00:00
4 13.0 0.7620001691640255 start_time 2022-01-19 04:00:00
23 13.0 0.0 end_time 2022-01-19 17:00:00
5 10.0 1.5240003383280751 start_time 2022-01-21 14:00:00
24 10.0 0.0 end_time 2022-01-22 00:00:00
6 14.0 3.0480006766561503 start_time 2022-01-27 02:00:00
25 14.0 0.0 end_time 2022-01-27 16:00:00
7 10.0 0.5080001127760251 start_time 2022-02-03 05:00:00
26 10.0 0.0 end_time 2022-02-03 15:00:00
8 18.0 7.366001635252363 start_time 2022-02-03 23:00:00
27 18.0 0.0 end_time 2022-02-04 17:00:00
9 13.0 2.28600050749211 start_time 2022-02-05 11:00:00
28 13.0 0.0 end_time 2022-02-06 00:00:00
10 19.0 2.2860005074921173 start_time 2022-02-06 04:00:00
29 19.0 0.0 end_time 2022-02-06 23:00:00
11 13.0 1.2700002819400584 start_time 2022-02-07 11:00:00
30 13.0 0.0 end_time 2022-02-08 00:00:00
12 12.0 2.79400062026814 start_time 2022-02-09 01:00:00
31 12.0 0.0 end_time 2022-02-09 13:00:00
13 112.0 20.320004511041 start_time 2022-02-12 11:00:00
32 112.0 0.0 end_time 2022-02-17 03:00:00
14 28.0 2.0320004511041034 start_time 2022-02-18 14:00:00
33 28.0 0.0 end_time 2022-02-19 18:00:00
15 17.0 17.272003834384847 start_time 2022-02-23 17:00:00
34 17.0 0.0 end_time 2022-02-24 10:00:00
16 9.0 0.7620001691640397 start_time 2022-02-27 13:00:00
35 9.0 0.0 end_time 2022-02-27 22:00:00
17 18.0 4.0640009022082 start_time 2022-04-04 00:00:00
36 18.0 0.0 end_time 2022-04-04 18:00:00
18 15.0 1.0160002255520482 start_time 2022-04-06 05:00:00
37 15.0 0.0 end_time 2022-04-06 20:00:00
when plotted using
plt.step(bdf2['date'], bdf2['br_open_total'])
plt.gcf().set_size_inches(10, 4)
plt.xticks(rotation=90)
produces the plot shown above, in which the top left corner of a block corresponds to the previous data point.
edit 5: further info
When I plot all my dataframes (different sensors) I get the same differential on the event start and end times?
You can use a step plot:
# ensure datetime
df['start_time'] = pd.to_datetime(df['start_time'])
df['end_time'] = pd.to_datetime(df['end_time'])
# reshape the data
df2 = (df
.melt(id_vars=['duration', 'br_open_total'], value_name='date')
.sort_values(by='date')
.drop_duplicates(subset='date')
.assign(br_open_total=lambda d: d['br_open_total'].mask(d['variable'].eq('end_time'), 0))
)
# plot
import matplotlib.pyplot as plt
plt.step(df2['date'], df2['br_open_total'])
plt.gcf().set_size_inches(10, 4)
output:
reshaped data:
duration br_open_total variable date
0 9.0 0.254000 start_time 2022-01-01 10:00:00
11 9.0 0.000000 end_time 2022-01-01 19:00:00
1 10.0 1.016000 start_time 2022-01-02 00:00:00
12 10.0 0.000000 end_time 2022-01-02 10:00:00
2 9.0 0.762000 start_time 2022-01-02 17:00:00
3 34.0 10.668002 start_time 2022-01-03 02:00:00
14 34.0 0.000000 end_time 2022-01-04 12:00:00
4 9.0 0.254000 start_time 2022-01-07 21:00:00
15 9.0 0.000000 end_time 2022-01-08 06:00:00
5 15.0 0.508000 start_time 2022-01-16 05:00:00
16 15.0 0.000000 end_time 2022-01-16 20:00:00
6 13.0 0.762000 start_time 2022-01-19 04:00:00
17 13.0 0.000000 end_time 2022-01-19 17:00:00
7 10.0 1.524000 start_time 2022-01-21 14:00:00
18 10.0 0.000000 end_time 2022-01-22 00:00:00
8 14.0 3.048001 start_time 2022-01-27 02:00:00
19 14.0 0.000000 end_time 2022-01-27 16:00:00
9 9.0 0.254000 start_time 2022-02-01 12:00:00
20 9.0 0.000000 end_time 2022-02-01 21:00:00
10 10.0 0.508000 start_time 2022-02-03 05:00:00
21 10.0 0.000000 end_time 2022-02-03 15:00:00
Try this:
import matplotlib.pyplot as plt
for ind,row in df.iterrows():
plt.plot(pd.Series([row['start_time'],row['end_time']]),pd.Series([row['br_open_total'],row['br_open_total']]),color='b')
plt.plot(pd.Series([row['start_time'],row['start_time']]),pd.Series([0,row['br_open_total']]),color='b')
plt.plot(pd.Series([row['end_time'],row['end_time']]),pd.Series([0,row['br_open_total']]),color='b')
plt.xticks(rotation=90)
Result:
I believe I have now cracked it, with a great debt of thanks to #Mozway.
The code to restructure the dataframe for plotting:
#create dataframes of each open gauge events removing any event with an open total of less than 0.254mm
#bresser/open
bdftdf=bdf.loc[bdf['br_open_total'] > 0.255]
bdftdf=bdftdf.copy()
bdftdf['start_time'] = pd.to_datetime(bdftdf['start_time'])
bdftdf['end_time'] = pd.to_datetime(bdftdf['end_time'])
bdf2 = (bdftdf
.melt(id_vars=['duration', 'ic_total','mc_total','md_total','imd_total','oak_total','highpoint_total','school_total','br_open_total',
'fr_gauge_total','open_mean_total','br_open_ic_%_int','br_open_mc_%_int','br_open_md_%_int','br_open_imd_%_int',
'br_open_oak_%_int'], value_name='date')
.sort_values(by='date')
#.drop_duplicates(subset='date')
.assign(br_open_total=lambda d: d['br_open_total'].mask(d['variable'].eq('end_time'), 0))
)
#create array for stairs plot
bdfarr=np.array(bdf2['date'])
bl=len(bdf2)
bdfarr=np.append(bdfarr,[bdfarr[bl-1]+np.timedelta64(1,'h')])
Rather than use the plt.step plot as suggested by Mozway, I have used plt.stairs, after creating an array of the 'date' column in the dataframe and appending an extra element to that array equal to the last element =1hour.
This means that the data now plots as I had intended it to.:
code for plot:
fig1=plt.figure()
plt.stairs(bdf2['br_open_total'], bdfarr, label='Bresser\Open')
plt.stairs(frdf2['fr_gauge_total'], frdfarr, label='FR Gauge')
plt.stairs(hpdf2['highpoint_total'], hpdfarr, label='Highpoint')
plt.stairs(schdf2['school_total'], schdfarr, label='School')
plt.stairs(opmedf2['open_mean_total'], opmedfarr, label='Open mean')
plt.xticks(rotation=90)
plt.legend(title='Rain events', loc='best')
plt.show()
so I'm trying to apply different conditions that depends on a date, months to be specific. For example, for January replace the data in TEMP that is above 45 but for February that is above 30 and so on. I already did that with the code below, but the problem is that the data from the previous month is replace it with nan.
This is my code:
meses = ["01", "02"]
for i in var_vars:
if i in dataframes2.columns.values:
for j in range(len(meses)):
test_prueba_mes = dataframes2[i].loc[dataframes2['fecha'].dt.month == int(meses[j])]
test_prueba = test_prueba_mes[dataframes2[i]<dataframes.loc[i]["X"+meses[j]+".max"]]
dataframes2["Prueba " + str(i)] = test_prueba
Output:
dataframes2.tail(5)
fecha TEMP_C_Avg RH_Avg Prueba TEMP_C_Avg Prueba RH_Avg
21 2020-01-01 22:00:00 46.0 103 NaN NaN
22 2020-01-01 23:00:00 29.0 103 NaN NaN
23 2020-01-02 00:00:00 31.0 3 NaN NaN
24 2020-01-02 12:00:00 31.0 2 NaN NaN
25 2020-02-01 10:00:00 29.0 5 29.0 5.0
My desired Output is:
Output:
fecha TEMP_C_Avg RH_Avg Prueba TEMP_C_Avg Prueba RH_Avg
21 2020-01-01 22:00:00 46.0 103 NaN NaN
22 2020-01-01 23:00:00 29.0 103 29.0 NaN
23 2020-01-02 00:00:00 31.0 3 31.0 3.0
24 2020-01-02 12:00:00 31.0 2 31.0 2.0
25 2020-02-01 10:00:00 29.0 5 29.0 5.0
Appreciate if anyone can help me.
Update: The ruleset for 6 months is jan 45, feb 30, mar 45, abr 10, may 15, jun 30
An example of the data:
fecha TEMP_C_Avg RH_Avg
25 2020-02-01 10:00:00 29.0 5
26 2020-02-01 11:00:00 32.0 105
27 2020-03-01 10:00:00 55.0 3
28 2020-03-01 11:00:00 40.0 5
29 2020-04-01 10:00:00 10.0 20
30 2020-04-01 11:00:00 5.0 15
31 2020-05-01 10:00:00 20.0 15
32 2020-05-01 11:00:00 5.0 106
33 2020-06-01 10:00:00 33.0 107
34 2020-06-01 11:00:00 20.0 20
With clear understanding
have encoded monthly limits into a dict limits
use numpy select(), when a condition matches take value corresponding to condition from second parameter. Default to third parameter
build conditions dynamically from limits dict
second parameter needs to be same length as conditions list. Build list of np.nan as list comprehension so it's correct length
to consider all columns, use a dict comprehension that builds **kwarg params to assign()
df = pd.read_csv(io.StringIO(""" fecha TEMP_C_Avg RH_Avg
25 2020-02-01 10:00:00 29.0 5
26 2020-02-01 11:00:00 32.0 105
27 2020-03-01 10:00:00 55.0 3
28 2020-03-01 11:00:00 40.0 5
29 2020-04-01 10:00:00 10.0 20
30 2020-04-01 11:00:00 5.0 15
31 2020-05-01 10:00:00 20.0 15
32 2020-05-01 11:00:00 5.0 106
33 2020-06-01 10:00:00 33.0 107
34 2020-06-01 11:00:00 20.0 20"""), sep="\s\s+", engine="python")
df.fecha = pd.to_datetime(df.fecha)
# The ruleset for 6 months is jan 45, feb 30, mar 45, abr 10, may 15, jun 30
limits = {1:45, 2:30, 3:45, 4:10, 5:15, 6:30}
df = df.assign(**{f"Prueba {c}":np.select( # construct target column name
# build a condition for each of the month limits
[df.fecha.dt.month.eq(m) & df[c].gt(l) for m,l in limits.items()],
[np.nan for m in limits.keys()], # NaN if beyond limit
df[c]) # keep value if within limits
for c in df.columns if "Avg" in c}) # do calc for all columns that have "Avg" in name
fecha
TEMP_C_Avg
RH_Avg
Prueba TEMP_C_Avg
Prueba RH_Avg
25
2020-02-01 10:00:00
29
5
29
5
26
2020-02-01 11:00:00
32
105
nan
nan
27
2020-03-01 10:00:00
55
3
nan
3
28
2020-03-01 11:00:00
40
5
40
5
29
2020-04-01 10:00:00
10
20
10
nan
30
2020-04-01 11:00:00
5
15
5
nan
31
2020-05-01 10:00:00
20
15
nan
15
32
2020-05-01 11:00:00
5
106
5
nan
33
2020-06-01 10:00:00
33
107
nan
nan
34
2020-06-01 11:00:00
20
20
20
20
If I had some random data created on a one hour sample..
import pandas as pd
import numpy as np
from numpy.random import randint
np.random.seed(10) # added for reproductibility
rng = pd.date_range('10/9/2018 00:00', periods=1000, freq='1H')
df = pd.DataFrame({'Random_Number':randint(1, 100, 1000)}, index=rng)
I can use the groupby to break out each day:
for idx, day in df.groupby(df.index.date):
print(day)
Now is there a way to calculate the time difference between daily min & max value based on the timestamp in hours? for each day record the daily min & max & time difference?
After some discussion (thanks #Erfan):
(df.Random_Number
.groupby(df.index.date)
.agg(['idxmin','idxmax'])
.diff(axis=1).iloc[:,1]
.div(pd.to_timedelta('1H'))
)
Output:
2018-10-09 -4.0
2018-10-10 -1.0
2018-10-11 -4.0
2018-10-12 12.0
2018-10-13 21.0
2018-10-14 6.0
2018-10-15 -6.0
2018-10-16 -18.0
2018-10-17 -8.0
2018-10-18 9.0
2018-10-19 -10.0
2018-10-20 3.0
2018-10-21 10.0
2018-10-22 2.0
2018-10-23 9.0
2018-10-24 2.0
2018-10-25 3.0
2018-10-26 2.0
2018-10-27 -22.0
2018-10-28 6.0
2018-10-29 -8.0
2018-10-30 -1.0
2018-10-31 -11.0
2018-11-01 19.0
2018-11-02 7.0
2018-11-03 4.0
2018-11-04 18.0
2018-11-05 -1.0
2018-11-06 15.0
2018-11-07 -14.0
2018-11-08 -16.0
2018-11-09 -2.0
2018-11-10 -7.0
2018-11-11 -14.0
2018-11-12 12.0
2018-11-13 -14.0
2018-11-14 2.0
2018-11-15 2.0
2018-11-16 6.0
2018-11-17 -7.0
2018-11-18 5.0
2018-11-19 9.0
Name: idxmax, dtype: float64
Alternatively, should you want to retain all columns with data frame output, consider merging on an aggregated dataset:
# ADJUST FOR DATETIME AND DATE AS COLUMNS
df = (df.reset_index()
.assign(date = df.index.date)
)
# AGGREGATION + MERGE ON MIN/MAX + CALCULATION
agg_df = (df.groupby('date')['Random_Number']
.agg(["min", "max"])
.reset_index()
.merge(df, left_on=['date', 'max'], right_on=['date', 'Random_Number'])
.merge(df, left_on=['date', 'min'], right_on=['date', 'Random_Number'],
suffixes=['', '_min'])
.assign(diff = lambda x: (x['index'] - x['index_min']) / pd.to_timedelta('1H'))
)
print(agg_df.head(10))
# date min max index Random_Number index_min Random_Number_min diff
# 0 2018-10-09 1 94 2018-10-09 05:00:00 94 2018-10-09 09:00:00 1 -4.0
# 1 2018-10-10 12 95 2018-10-10 20:00:00 95 2018-10-10 21:00:00 12 -1.0
# 2 2018-10-11 5 97 2018-10-11 15:00:00 97 2018-10-11 19:00:00 5 -4.0
# 3 2018-10-12 7 98 2018-10-12 18:00:00 98 2018-10-12 06:00:00 7 12.0
# 4 2018-10-13 1 91 2018-10-13 22:00:00 91 2018-10-13 01:00:00 1 21.0
# 5 2018-10-14 1 97 2018-10-14 10:00:00 97 2018-10-14 04:00:00 1 6.0
# 6 2018-10-15 9 97 2018-10-15 06:00:00 97 2018-10-15 12:00:00 9 -6.0
# 7 2018-10-16 3 95 2018-10-16 04:00:00 95 2018-10-16 22:00:00 3 -18.0
# 8 2018-10-17 2 95 2018-10-17 13:00:00 95 2018-10-17 21:00:00 2 -8.0
# 9 2018-10-18 1 91 2018-10-18 21:00:00 91 2018-10-18 12:00:00 1 9.0
I have a dataframe which has 4 columns: day, time, tmin and tmax. tmin shows the temperature_min of the day and tmax shows the temperature_max.
What I want is to be able to fill all of the NaN values of one day with tmin and tmax of that day. For example I want to convert this dataframe:
day time tmin tmax
0 01 00:00:00 NaN NaN
1 01 03:00:00 -6.8 NaN
2 01 06:00:00 NaN NaN
3 01 09:00:00 NaN NaN
4 01 12:00:00 NaN NaN
5 01 15:00:00 NaN 1.2
6 01 18:00:00 NaN NaN
7 01 21:00:00 NaN NaN
8 02 00:00:00 NaN NaN
9 02 03:00:00 -7.2 NaN
10 02 06:00:00 NaN NaN
11 02 09:00:00 NaN NaN
12 02 12:00:00 NaN NaN
13 02 15:00:00 NaN 1.8
14 02 18:00:00 NaN NaN
15 02 21:00:00 NaN NaN
to this dataframe:
day time tmin tmax
0 01 00:00:00 -6.8 1.2
1 01 03:00:00 -6.8 1.2
2 01 06:00:00 -6.8 1.2
3 01 09:00:00 -6.8 1.2
4 01 12:00:00 -6.8 1.2
5 01 15:00:00 -6.8 1.2
6 01 18:00:00 -6.8 1.2
7 01 21:00:00 -6.8 1.2
8 02 00:00:00 -7.2 1.8
9 02 03:00:00 -7.2 1.8
10 02 06:00:00 -7.2 1.8
11 02 09:00:00 -7.2 1.8
12 02 12:00:00 -7.2 1.8
13 02 15:00:00 -7.2 1.8
14 02 18:00:00 -7.2 1.8
15 02 21:00:00 -7.2 1.8
Using groupby and transform:
df.assign(**df.groupby('day')[['tmin', 'tmax']].transform('first'))
day time tmin tmax
0 1 00:00:00 -6.8 1.2
1 1 03:00:00 -6.8 1.2
2 1 06:00:00 -6.8 1.2
3 1 09:00:00 -6.8 1.2
4 1 12:00:00 -6.8 1.2
5 1 15:00:00 -6.8 1.2
6 1 18:00:00 -6.8 1.2
7 1 21:00:00 -6.8 1.2
8 2 00:00:00 -7.2 1.8
9 2 03:00:00 -7.2 1.8
10 2 06:00:00 -7.2 1.8
11 2 09:00:00 -7.2 1.8
12 2 12:00:00 -7.2 1.8
13 2 15:00:00 -7.2 1.8
14 2 18:00:00 -7.2 1.8
15 2 21:00:00 -7.2 1.8
Or, if you'd like to modify the original DataFrame instead of returning a copy:
df[['tmin', 'tmax']] = df.groupby('day')[['tmin', 'tmax']].transform('first')
If you want to do it not as neatly as #user3483203 has done!
import pandas as pd
myfile = pd.read_csv('temperature.txt', sep=' ')
mydata = pd.DataFrame(data = myfile)
for i in mydata['day']:
row_start = (i-1) * 8 # assuming 8 data points per day
row_end = (i) * 8
mydata['tmin'][row_start:row_end] = pd.DataFrame.min(tempdata['tmin'][row_start:row_end], skipna=True)
mydata['tmax'][row_start:row_end] = pd.DataFrame.max(tempdata['tmax'][row_start:row_end], skipna=True)
Since you did not post any code, here's a general solution:
Step 1: Create variables that will keep track of the min and max temps
Step 2: Loop through each row in the frame
Step 3: For each row, check if the min or max == "NaN"
Step 4: If it is, replace with the value of the min or max variable we created earlier
just use the fillna with the forward fill and back fill parameters:
df.tmin = df.groupby('day')['tmin'].fillna(method='ffill').fillna(method='bfill')
df.tmax = df.groupby('day')['tmax'].fillna(method='ffill').fillna(method='bfill')