Plotting events on a line graph - python

I am trying to visualise rain events using a data contained in a dataframe.
the idea seems very simple, but the execution seems to be impossible!
here is a part of the dataframe:
start_time end_time duration br_open_total
0 2022-01-01 10:00:00 2022-01-01 19:00:00 9.0 0.2540000563879943
1 2022-01-02 00:00:00 2022-01-02 10:00:00 10.0 1.0160002255520624
2 2022-01-02 17:00:00 2022-01-03 02:00:00 9.0 0.7620001691640113
3 2022-01-03 02:00:00 2022-01-04 12:00:00 34.0 10.668002368296513
4 2022-01-07 21:00:00 2022-01-08 06:00:00 9.0 0.2540000563879943
5 2022-01-16 05:00:00 2022-01-16 20:00:00 15.0 0.5080001127760454
6 2022-01-19 04:00:00 2022-01-19 17:00:00 13.0 0.7620001691640255
7 2022-01-21 14:00:00 2022-01-22 00:00:00 10.0 1.5240003383280751
8 2022-01-27 02:00:00 2022-01-27 16:00:00 14.0 3.0480006766561503
9 2022-02-01 12:00:00 2022-02-01 21:00:00 9.0 0.2540000563880126
10 2022-02-03 05:00:00 2022-02-03 15:00:00 10.0 0.5080001127760251
What I want to do is have a plot with time on the x axis, and the value of the 'br_open_total' on the y axis.
I can draw what I want it to look like, see below:
I apologise for the simplicity of the drawing, but I think it explains what I want to do.
How do I do this, and then repeat for other dataframes on the same plot.
I have tried staircase, matplotlib.pyplot.stair and others with no success.
It seems such a simple concept!
Edit 1:
Tried Joswin K J's answer with the actual data, and got this:
The event at 02-12 11:00 should be 112 hours duration, but the bar is the same width as all the others.
Edit2:
Tried Mozway's answer and got this:
Still doesn't show width of each event, and doesn't discretise the events either
Edit 3:
Using Mozway's amended answer I get this plot for the actual data:
I have added the cursor position using paint, at the top right of the plot you can see that the cursor is at 2022-02-09 and 20.34, which is actually the value for 2022-02-01, so it seems that the plot is shifted to the left by one data point?, also the large block between 2022-3-01 and 2022-04-03 doesn't seem to be in the data
edit 4: as requested by Mozway
Reshaped Data
duration br_open_total variable date
0 10.0 1.0160002255520624 start_time 2022-01-02 00:00:00
19 10.0 0.0 end_time 2022-01-02 10:00:00
1 9.0 0.7620001691640113 start_time 2022-01-02 17:00:00
2 34.0 10.668002368296513 start_time 2022-01-03 02:00:00
21 34.0 0.0 end_time 2022-01-04 12:00:00
3 15.0 0.5080001127760454 start_time 2022-01-16 05:00:00
22 15.0 0.0 end_time 2022-01-16 20:00:00
4 13.0 0.7620001691640255 start_time 2022-01-19 04:00:00
23 13.0 0.0 end_time 2022-01-19 17:00:00
5 10.0 1.5240003383280751 start_time 2022-01-21 14:00:00
24 10.0 0.0 end_time 2022-01-22 00:00:00
6 14.0 3.0480006766561503 start_time 2022-01-27 02:00:00
25 14.0 0.0 end_time 2022-01-27 16:00:00
7 10.0 0.5080001127760251 start_time 2022-02-03 05:00:00
26 10.0 0.0 end_time 2022-02-03 15:00:00
8 18.0 7.366001635252363 start_time 2022-02-03 23:00:00
27 18.0 0.0 end_time 2022-02-04 17:00:00
9 13.0 2.28600050749211 start_time 2022-02-05 11:00:00
28 13.0 0.0 end_time 2022-02-06 00:00:00
10 19.0 2.2860005074921173 start_time 2022-02-06 04:00:00
29 19.0 0.0 end_time 2022-02-06 23:00:00
11 13.0 1.2700002819400584 start_time 2022-02-07 11:00:00
30 13.0 0.0 end_time 2022-02-08 00:00:00
12 12.0 2.79400062026814 start_time 2022-02-09 01:00:00
31 12.0 0.0 end_time 2022-02-09 13:00:00
13 112.0 20.320004511041 start_time 2022-02-12 11:00:00
32 112.0 0.0 end_time 2022-02-17 03:00:00
14 28.0 2.0320004511041034 start_time 2022-02-18 14:00:00
33 28.0 0.0 end_time 2022-02-19 18:00:00
15 17.0 17.272003834384847 start_time 2022-02-23 17:00:00
34 17.0 0.0 end_time 2022-02-24 10:00:00
16 9.0 0.7620001691640397 start_time 2022-02-27 13:00:00
35 9.0 0.0 end_time 2022-02-27 22:00:00
17 18.0 4.0640009022082 start_time 2022-04-04 00:00:00
36 18.0 0.0 end_time 2022-04-04 18:00:00
18 15.0 1.0160002255520482 start_time 2022-04-06 05:00:00
37 15.0 0.0 end_time 2022-04-06 20:00:00
when plotted using
plt.step(bdf2['date'], bdf2['br_open_total'])
plt.gcf().set_size_inches(10, 4)
plt.xticks(rotation=90)
produces the plot shown above, in which the top left corner of a block corresponds to the previous data point.
edit 5: further info
When I plot all my dataframes (different sensors) I get the same differential on the event start and end times?

You can use a step plot:
# ensure datetime
df['start_time'] = pd.to_datetime(df['start_time'])
df['end_time'] = pd.to_datetime(df['end_time'])
# reshape the data
df2 = (df
.melt(id_vars=['duration', 'br_open_total'], value_name='date')
.sort_values(by='date')
.drop_duplicates(subset='date')
.assign(br_open_total=lambda d: d['br_open_total'].mask(d['variable'].eq('end_time'), 0))
)
# plot
import matplotlib.pyplot as plt
plt.step(df2['date'], df2['br_open_total'])
plt.gcf().set_size_inches(10, 4)
output:
reshaped data:
duration br_open_total variable date
0 9.0 0.254000 start_time 2022-01-01 10:00:00
11 9.0 0.000000 end_time 2022-01-01 19:00:00
1 10.0 1.016000 start_time 2022-01-02 00:00:00
12 10.0 0.000000 end_time 2022-01-02 10:00:00
2 9.0 0.762000 start_time 2022-01-02 17:00:00
3 34.0 10.668002 start_time 2022-01-03 02:00:00
14 34.0 0.000000 end_time 2022-01-04 12:00:00
4 9.0 0.254000 start_time 2022-01-07 21:00:00
15 9.0 0.000000 end_time 2022-01-08 06:00:00
5 15.0 0.508000 start_time 2022-01-16 05:00:00
16 15.0 0.000000 end_time 2022-01-16 20:00:00
6 13.0 0.762000 start_time 2022-01-19 04:00:00
17 13.0 0.000000 end_time 2022-01-19 17:00:00
7 10.0 1.524000 start_time 2022-01-21 14:00:00
18 10.0 0.000000 end_time 2022-01-22 00:00:00
8 14.0 3.048001 start_time 2022-01-27 02:00:00
19 14.0 0.000000 end_time 2022-01-27 16:00:00
9 9.0 0.254000 start_time 2022-02-01 12:00:00
20 9.0 0.000000 end_time 2022-02-01 21:00:00
10 10.0 0.508000 start_time 2022-02-03 05:00:00
21 10.0 0.000000 end_time 2022-02-03 15:00:00

Try this:
import matplotlib.pyplot as plt
for ind,row in df.iterrows():
plt.plot(pd.Series([row['start_time'],row['end_time']]),pd.Series([row['br_open_total'],row['br_open_total']]),color='b')
plt.plot(pd.Series([row['start_time'],row['start_time']]),pd.Series([0,row['br_open_total']]),color='b')
plt.plot(pd.Series([row['end_time'],row['end_time']]),pd.Series([0,row['br_open_total']]),color='b')
plt.xticks(rotation=90)
Result:

I believe I have now cracked it, with a great debt of thanks to #Mozway.
The code to restructure the dataframe for plotting:
#create dataframes of each open gauge events removing any event with an open total of less than 0.254mm
#bresser/open
bdftdf=bdf.loc[bdf['br_open_total'] > 0.255]
bdftdf=bdftdf.copy()
bdftdf['start_time'] = pd.to_datetime(bdftdf['start_time'])
bdftdf['end_time'] = pd.to_datetime(bdftdf['end_time'])
bdf2 = (bdftdf
.melt(id_vars=['duration', 'ic_total','mc_total','md_total','imd_total','oak_total','highpoint_total','school_total','br_open_total',
'fr_gauge_total','open_mean_total','br_open_ic_%_int','br_open_mc_%_int','br_open_md_%_int','br_open_imd_%_int',
'br_open_oak_%_int'], value_name='date')
.sort_values(by='date')
#.drop_duplicates(subset='date')
.assign(br_open_total=lambda d: d['br_open_total'].mask(d['variable'].eq('end_time'), 0))
)
#create array for stairs plot
bdfarr=np.array(bdf2['date'])
bl=len(bdf2)
bdfarr=np.append(bdfarr,[bdfarr[bl-1]+np.timedelta64(1,'h')])
Rather than use the plt.step plot as suggested by Mozway, I have used plt.stairs, after creating an array of the 'date' column in the dataframe and appending an extra element to that array equal to the last element =1hour.
This means that the data now plots as I had intended it to.:
code for plot:
fig1=plt.figure()
plt.stairs(bdf2['br_open_total'], bdfarr, label='Bresser\Open')
plt.stairs(frdf2['fr_gauge_total'], frdfarr, label='FR Gauge')
plt.stairs(hpdf2['highpoint_total'], hpdfarr, label='Highpoint')
plt.stairs(schdf2['school_total'], schdfarr, label='School')
plt.stairs(opmedf2['open_mean_total'], opmedfarr, label='Open mean')
plt.xticks(rotation=90)
plt.legend(title='Rain events', loc='best')
plt.show()

Related

Date time conversion to pandas datetime64[updated]

I have a series of 40 year data in the format stn;yyyymmddhh:rainfall. I want to convert the data into datetime64 format. When i convert it to datetime with the below code, i get the following format pandas._libs.tslibs.timestamps.Timestamp But, i need it to be in pandas datetime format. Basically, i want to convert for example 1981010100 which is numpy.int64 into datetime64.
import pandas as pd
df = pd.read_csv('data.txt', delimiter = ";")
df['yyyy'] = df['yyyymmddhh'].astype(str).str[:4]
df = pd.to_datetime(data.yyyy, format='%Y-%m-%d')
Stn;yyyymmddhh;rainfall
xyz;1981010100;0.0
xyz;1981010101;0.0
xyz;1981010102;0.0
xyz;1981010103;0.0
xyz;1981010104;0.0
xyz;1981010105;0.0
xyz;1981010106;0.0
xyz;1981010107;0.0
xyz;1981010108;0.0
xyz;1981010109;0.4
xyz;1981010110;0.6
xyz;1981010111;0.1
xyz;1981010112;0.1
xyz;1981010113;0.0
xyz;1981010114;0.1
xyz;1981010115;0.6
xyz;1981010116;0.0
xyz;1981010117;0.0
xyz;1981010118;0.2
xyz;1981010119;0.0
xyz;1981010120;0.0
xyz;1981010121;0.0
xyz;1981010122;0.0
xyz;1981010123;0.0
xyz;1981010200;0.0
You can use pd.to_datetime() together with format= parameter, as follows:
df['yyyymmddhh'] = pd.to_datetime(df['yyyymmddhh'], format='%Y%m%d%H')
Output:
print(df)
Stn yyyymmddhh rainfall
0 xyz 1981-01-01 00:00:00 0.0
1 xyz 1981-01-01 01:00:00 0.0
2 xyz 1981-01-01 02:00:00 0.0
3 xyz 1981-01-01 03:00:00 0.0
4 xyz 1981-01-01 04:00:00 0.0
5 xyz 1981-01-01 05:00:00 0.0
6 xyz 1981-01-01 06:00:00 0.0
7 xyz 1981-01-01 07:00:00 0.0
8 xyz 1981-01-01 08:00:00 0.0
9 xyz 1981-01-01 09:00:00 0.4
10 xyz 1981-01-01 10:00:00 0.6
11 xyz 1981-01-01 11:00:00 0.1
12 xyz 1981-01-01 12:00:00 0.1
13 xyz 1981-01-01 13:00:00 0.0
14 xyz 1981-01-01 14:00:00 0.1
15 xyz 1981-01-01 15:00:00 0.6
16 xyz 1981-01-01 16:00:00 0.0
17 xyz 1981-01-01 17:00:00 0.0
18 xyz 1981-01-01 18:00:00 0.2
19 xyz 1981-01-01 19:00:00 0.0
20 xyz 1981-01-01 20:00:00 0.0
21 xyz 1981-01-01 21:00:00 0.0
22 xyz 1981-01-01 22:00:00 0.0
23 xyz 1981-01-01 23:00:00 0.0
24 xyz 1981-01-02 00:00:00 0.0
I believe this should fit the bill for you.
import pandas as pd
df = pd.read_csv('data.txt', delimiter = ";")
df['date'] = pd.to_datetime(df['yyyymmddhh'], format='%Y%m%d%H')
df['formatted'] = pd.to_datetime(df['date'].dt.strftime('%Y-%m-%d %H:%M:%S'))

pandas calculate time series resampling

If I had some random data created on a one hour sample..
import pandas as pd
import numpy as np
from numpy.random import randint
np.random.seed(10) # added for reproductibility
rng = pd.date_range('10/9/2018 00:00', periods=1000, freq='1H')
df = pd.DataFrame({'Random_Number':randint(1, 100, 1000)}, index=rng)
I can use the groupby to break out each day:
for idx, day in df.groupby(df.index.date):
print(day)
Now is there a way to calculate the time difference between daily min & max value based on the timestamp in hours? for each day record the daily min & max & time difference?
After some discussion (thanks #Erfan):
(df.Random_Number
.groupby(df.index.date)
.agg(['idxmin','idxmax'])
.diff(axis=1).iloc[:,1]
.div(pd.to_timedelta('1H'))
)
Output:
2018-10-09 -4.0
2018-10-10 -1.0
2018-10-11 -4.0
2018-10-12 12.0
2018-10-13 21.0
2018-10-14 6.0
2018-10-15 -6.0
2018-10-16 -18.0
2018-10-17 -8.0
2018-10-18 9.0
2018-10-19 -10.0
2018-10-20 3.0
2018-10-21 10.0
2018-10-22 2.0
2018-10-23 9.0
2018-10-24 2.0
2018-10-25 3.0
2018-10-26 2.0
2018-10-27 -22.0
2018-10-28 6.0
2018-10-29 -8.0
2018-10-30 -1.0
2018-10-31 -11.0
2018-11-01 19.0
2018-11-02 7.0
2018-11-03 4.0
2018-11-04 18.0
2018-11-05 -1.0
2018-11-06 15.0
2018-11-07 -14.0
2018-11-08 -16.0
2018-11-09 -2.0
2018-11-10 -7.0
2018-11-11 -14.0
2018-11-12 12.0
2018-11-13 -14.0
2018-11-14 2.0
2018-11-15 2.0
2018-11-16 6.0
2018-11-17 -7.0
2018-11-18 5.0
2018-11-19 9.0
Name: idxmax, dtype: float64
Alternatively, should you want to retain all columns with data frame output, consider merging on an aggregated dataset:
# ADJUST FOR DATETIME AND DATE AS COLUMNS
df = (df.reset_index()
.assign(date = df.index.date)
)
# AGGREGATION + MERGE ON MIN/MAX + CALCULATION
agg_df = (df.groupby('date')['Random_Number']
.agg(["min", "max"])
.reset_index()
.merge(df, left_on=['date', 'max'], right_on=['date', 'Random_Number'])
.merge(df, left_on=['date', 'min'], right_on=['date', 'Random_Number'],
suffixes=['', '_min'])
.assign(diff = lambda x: (x['index'] - x['index_min']) / pd.to_timedelta('1H'))
)
print(agg_df.head(10))
# date min max index Random_Number index_min Random_Number_min diff
# 0 2018-10-09 1 94 2018-10-09 05:00:00 94 2018-10-09 09:00:00 1 -4.0
# 1 2018-10-10 12 95 2018-10-10 20:00:00 95 2018-10-10 21:00:00 12 -1.0
# 2 2018-10-11 5 97 2018-10-11 15:00:00 97 2018-10-11 19:00:00 5 -4.0
# 3 2018-10-12 7 98 2018-10-12 18:00:00 98 2018-10-12 06:00:00 7 12.0
# 4 2018-10-13 1 91 2018-10-13 22:00:00 91 2018-10-13 01:00:00 1 21.0
# 5 2018-10-14 1 97 2018-10-14 10:00:00 97 2018-10-14 04:00:00 1 6.0
# 6 2018-10-15 9 97 2018-10-15 06:00:00 97 2018-10-15 12:00:00 9 -6.0
# 7 2018-10-16 3 95 2018-10-16 04:00:00 95 2018-10-16 22:00:00 3 -18.0
# 8 2018-10-17 2 95 2018-10-17 13:00:00 95 2018-10-17 21:00:00 2 -8.0
# 9 2018-10-18 1 91 2018-10-18 21:00:00 91 2018-10-18 12:00:00 1 9.0

Calculate sum of Column grouped by hour

I am trying to calculate the total cost of staffing requirements over a day. My attempt is to group People required throughout the day and multiply the cost. I then try to group this cost per/hour. But my output isn't correct.
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.dates as dates
d = ({
'Time' : ['0/1/1900 8:00:00','0/1/1900 9:59:00','0/1/1900 10:00:00','0/1/1900 12:29:00','0/1/1900 12:30:00','0/1/1900 13:00:00','0/1/1900 13:02:00','0/1/1900 13:15:00','0/1/1900 13:20:00','0/1/1900 18:10:00','0/1/1900 18:15:00','0/1/1900 18:20:00','0/1/1900 18:25:00','0/1/1900 18:45:00','0/1/1900 18:50:00','0/1/1900 19:05:00','0/1/1900 19:07:00','0/1/1900 21:57:00','0/1/1900 22:00:00','0/1/1900 22:30:00','0/1/1900 22:35:00','1/1/1900 3:00:00','1/1/1900 3:05:00','1/1/1900 3:20:00','1/1/1900 3:25:00'],
'People' : [1,1,2,2,3,3,2,2,3,3,4,4,3,3,2,2,3,3,4,4,3,3,2,2,1],
})
df = pd.DataFrame(data = d)
df['Time'] = ['/'.join([str(int(x.split('/')[0])+1)] + x.split('/')[1:]) for x in df['Time']]
df['Time'] = pd.to_datetime(df['Time'], format='%d/%m/%Y %H:%M:%S')
formatter = dates.DateFormatter('%Y-%m-%d %H:%M:%S')
df = df.groupby(pd.Grouper(freq='15T',key='Time'))['People'].max().ffill()
df = df.reset_index(level=['Time'])
df['Cost'] = df['People'] * 26
cost = df.groupby([df['Time'].dt.hour])['Cost'].sum()
#For reference. This plot displays people required throughout the day
fig, ax = plt.subplots(figsize = (10,5))
plt.plot(df['Time'], df['People'], color = 'blue')
plt.locator_params(axis='y', nbins=6)
ax.xaxis.set_major_formatter(formatter)
ax.xaxis.set_major_formatter(dates.DateFormatter('%H:%M:%S'))
plt.ylabel('People Required', labelpad = 10)
plt.xlabel('Time', labelpad = 10)
print(cost)
Out:
0 416.0
1 416.0
2 416.0
3 130.0
8 104.0
9 104.0
10 208.0
11 208.0
12 260.0
13 312.0
14 312.0
15 312.0
16 312.0
17 312.0
18 364.0
19 312.0
20 312.0
21 312.0
22 416.0
23 416.0
I have done the calculations manually an the total cost output should be:
$1456
I think the wrong numbers in your question is most likely caused by the incorrect datetime values that you have. Once you have fixed that, you should get the correct numbers. Here's an attempt from my end, with a little tweak to the Time column.
import pandas as pd
df = pd.DataFrame({
'Time' : ['1/1/1900 8:00:00','1/1/1900 9:59:00','1/1/1900 10:00:00','1/1/1900 12:29:00','1/1/1900 12:30:00','1/1/1900 13:00:00','1/1/1900 13:02:00','1/1/1900 13:15:00','1/1/1900 13:20:00','1/1/1900 18:10:00','1/1/1900 18:15:00','1/1/1900 18:20:00','1/1/1900 18:25:00','1/1/1900 18:45:00','1/1/1900 18:50:00','1/1/1900 19:05:00','1/1/1900 19:07:00','1/1/1900 21:57:00','1/1/1900 22:00:00','1/1/1900 22:30:00','1/1/1900 22:35:00','1/2/1900 3:00:00','1/2/1900 3:05:00','1/2/1900 3:20:00','1/2/1900 3:25:00'],
'People' : [1,1,2,2,3,3,2,2,3,3,4,4,3,3,2,2,3,3,4,4,3,3,2,2,1],
})
>>>df
Time People
0 1/1/1900 8:00:00 1
1 1/1/1900 9:59:00 1
2 1/1/1900 10:00:00 2
3 1/1/1900 12:29:00 2
4 1/1/1900 12:30:00 3
5 1/1/1900 13:00:00 3
6 1/1/1900 13:02:00 2
7 1/1/1900 13:15:00 2
8 1/1/1900 13:20:00 3
9 1/1/1900 18:10:00 3
10 1/1/1900 18:15:00 4
11 1/1/1900 18:20:00 4
12 1/1/1900 18:25:00 3
13 1/1/1900 18:45:00 3
14 1/1/1900 18:50:00 2
15 1/1/1900 19:05:00 2
16 1/1/1900 19:07:00 3
17 1/1/1900 21:57:00 3
18 1/1/1900 22:00:00 4
19 1/1/1900 22:30:00 4
20 1/1/1900 22:35:00 3
21 1/2/1900 3:00:00 3
22 1/2/1900 3:05:00 2
23 1/2/1900 3:20:00 2
24 1/2/1900 3:25:00 1
df.Time = pd.to_datetime(df.Time)
df.Time.set_index('Time', inplace=True)
df_group = df.resample('15T').max().ffill()
df_hour = df_group.resample('1h').max()
df_hour['Cost'] = df_hour['People'] * 26
>>>df_hour
People Cost
Time
1900-01-01 08:00:00 1.0 26.0
1900-01-01 09:00:00 1.0 26.0
1900-01-01 10:00:00 2.0 52.0
1900-01-01 11:00:00 2.0 52.0
1900-01-01 12:00:00 3.0 78.0
1900-01-01 13:00:00 3.0 78.0
1900-01-01 14:00:00 3.0 78.0
1900-01-01 15:00:00 3.0 78.0
1900-01-01 16:00:00 3.0 78.0
1900-01-01 17:00:00 3.0 78.0
1900-01-01 18:00:00 4.0 104.0
1900-01-01 19:00:00 3.0 78.0
1900-01-01 20:00:00 3.0 78.0
1900-01-01 21:00:00 3.0 78.0
1900-01-01 22:00:00 4.0 104.0
1900-01-01 23:00:00 4.0 104.0
1900-01-02 00:00:00 4.0 104.0
1900-01-02 01:00:00 4.0 104.0
1900-01-02 02:00:00 4.0 104.0
1900-01-02 03:00:00 3.0 78.0
>>>df_hour.sum()
People 60.0
Cost 1560.0
dtype: float64
Edit: Took me reading the second time to realize the methodology that you're using. The incorrect number that you got is likely due to grouping by sum() after you performed a ffill() on your aggregated People column. Since ffill() fills the holes from the last valid value, you actually overestimated your cost for these periods. You should be using max() again, to find the maximum number of headcount required for that hour.

Pandas DataFrame Calculate time difference between 2 columns on specific time range

I want to calculate time difference between two columns on specific time range.
I try df.between_time but it only works on index.
Ex. Time range: between 18.00 - 8.00
Data :
start stop
0 2018-07-16 16:00:00 2018-07-16 20:00:00
1 2018-07-11 08:03:00 2018-07-11 12:03:00
2 2018-07-13 17:54:00 2018-07-13 21:54:00
3 2018-07-14 13:09:00 2018-07-14 17:09:00
4 2018-07-20 00:21:00 2018-07-20 04:21:00
5 2018-07-20 17:00:00 2018-07-21 09:00:00
Expect Result :
start stop time_diff
0 2018-07-16 16:00:00 2018-07-16 20:00:00 02:00:00
1 2018-07-11 08:03:00 2018-07-11 12:03:00 0
2 2018-07-13 17:54:00 2018-07-13 21:54:00 03:54:00
3 2018-07-14 13:09:00 2018-07-14 17:09:00 0
4 2018-07-20 00:21:00 2018-07-20 04:21:00 04:00:00
5 2018-07-20 17:00:00 2018-07-21 09:00:00 14:00:00
Note: If time_diff > 1 days, I already deal with that case.
Question: Should I build a function to do this or there are pandas build-in function to do this? Any help or guide would be appreciated.
I think this can be a solution
tmp = pd.DataFrame({'time1': pd.to_datetime(['2018-07-16 16:00:00', '2018-07-11 08:03:00',
'2018-07-13 17:54:00', '2018-07-14 13:09:00',
'2018-07-20 00:21:00', '2018-07-20 17:00:00']),
'time2': pd.to_datetime(['2018-07-16 20:00:00', '2018-07-11 12:03:00',
'2018-07-13 21:54:00', '2018-07-14 17:09:00',
'2018-07-20 04:21:00', '2018-07-21 09:00:00'])})
time1_date = tmp.time1.dt.date.astype(str)
tmp['rule18'], tmp['rule08'] = pd.to_datetime(time1_date + ' 18:00:00'), pd.to_datetime(time1_date + ' 08:00:00')
# if stop exceeds 18:00:00, compute time difference from this hour
tmp['time_diff_rule1'] = np.where(tmp.time2 > tmp.rule18, (tmp.time2 - tmp.rule18), (tmp.time2 - tmp.time1))
# rearrange the dataframe with your second rule
tmp['time_diff_rule2'] = np.where((tmp.time2 < tmp.rule18) & (tmp.time1 > tmp.rule08), 0, tmp['time_diff_rule1'])
time_diff_rule1 time_diff_rule2
0 02:00:00 02:00:00
1 04:00:00 00:00:00
2 03:54:00 03:54:00
3 04:00:00 00:00:00
4 04:00:00 04:00:00
5 15:00:00 15:00:00

Resample python list with pandas

Fairly new to python and pandas here.
I make a query that's giving me back a timeseries. I'm never sure how many data points I receive from the query (run for a single day), but what I do know is that I need to resample them to contain 24 points (one for each hour in the day).
Printing m3hstream gives
[(1479218009000L, 109), (1479287368000L, 84)]
Then I try to make a dataframe df with
df = pd.DataFrame(data = list(m3hstream), columns=['Timestamp', 'Value'])
and this gives me an output of
Timestamp Value
0 1479218009000 109
1 1479287368000 84
Following I do this
daily_summary = pd.DataFrame()
daily_summary['value'] = df['Value'].resample('H').mean()
daily_summary = daily_summary.truncate(before=start, after=end)
print "Now daily summary"
print daily_summary
But this is giving me a TypeError: Only valid with DatetimeIndex, TimedeltaIndex or PeriodIndex, but got an instance of 'RangeIndex'
Could anyone please let me know how to resample it so I have 1 point for each hour in the 24 hour period that I'm querying for?
Thanks.
First thing you need to do is convert that 'Timestamp' to an actual pd.Timestamp. It looks like those are milliseconds
Then resample with the on parameter set to 'Timestamp'
df = df.assign(
Timestamp=pd.to_datetime(df.Timestamp, unit='ms')
).resample('H', on='Timestamp').mean().reset_index()
Timestamp Value
0 2016-11-15 13:00:00 109.0
1 2016-11-15 14:00:00 NaN
2 2016-11-15 15:00:00 NaN
3 2016-11-15 16:00:00 NaN
4 2016-11-15 17:00:00 NaN
5 2016-11-15 18:00:00 NaN
6 2016-11-15 19:00:00 NaN
7 2016-11-15 20:00:00 NaN
8 2016-11-15 21:00:00 NaN
9 2016-11-15 22:00:00 NaN
10 2016-11-15 23:00:00 NaN
11 2016-11-16 00:00:00 NaN
12 2016-11-16 01:00:00 NaN
13 2016-11-16 02:00:00 NaN
14 2016-11-16 03:00:00 NaN
15 2016-11-16 04:00:00 NaN
16 2016-11-16 05:00:00 NaN
17 2016-11-16 06:00:00 NaN
18 2016-11-16 07:00:00 NaN
19 2016-11-16 08:00:00 NaN
20 2016-11-16 09:00:00 84.0
If you want to fill those NaN values, use ffill, bfill, or interpolate
df.assign(
Timestamp=pd.to_datetime(df.Timestamp, unit='ms')
).resample('H', on='Timestamp').mean().reset_index().interpolate()
Timestamp Value
0 2016-11-15 13:00:00 109.00
1 2016-11-15 14:00:00 107.75
2 2016-11-15 15:00:00 106.50
3 2016-11-15 16:00:00 105.25
4 2016-11-15 17:00:00 104.00
5 2016-11-15 18:00:00 102.75
6 2016-11-15 19:00:00 101.50
7 2016-11-15 20:00:00 100.25
8 2016-11-15 21:00:00 99.00
9 2016-11-15 22:00:00 97.75
10 2016-11-15 23:00:00 96.50
11 2016-11-16 00:00:00 95.25
12 2016-11-16 01:00:00 94.00
13 2016-11-16 02:00:00 92.75
14 2016-11-16 03:00:00 91.50
15 2016-11-16 04:00:00 90.25
16 2016-11-16 05:00:00 89.00
17 2016-11-16 06:00:00 87.75
18 2016-11-16 07:00:00 86.50
19 2016-11-16 08:00:00 85.25
20 2016-11-16 09:00:00 84.00
Let's try:
daily_summary = daily_summary.set_index('Timestamp')
daily_summary.index = pd.to_datetime(daily_summary.index, unit='ms')
For once an hour:
daily_summary.resample('H').mean()
or for once a day:
daily_summary.resample('D').mean()

Categories