Pandas how to outer merge on datetime column correctly - python

I have two dataframes:
resetted.head()
WeightedSentiment Popularity Datetime
0 0 2 2012-11-22 11:00:00
1 0 2 2012-11-22 11:30:00
2 0 4 2012-11-22 12:00:00
3 0 2 2012-11-22 15:00:00
4 0 2 2012-11-22 15:30:00
prices.head()
Open High Low Close Volume Datetime
46623 236.9392 238.6095 236.5392 238.2094 315177 2012-11-23 10:00:00
46624 238.1894 238.3095 236.7492 237.4993 122132 2012-11-23 10:30:00
46625 237.4793 238.2595 237.1393 238.2094 144457 2012-11-23 11:00:00
46626 238.2094 238.9196 238.1694 238.7695 131733 2012-11-23 11:30:00
46627 238.7695 239.1396 237.9394 238.9496 150386 2012-11-23 12:00:00
And I tried to outer join these two dataframes, but by using
pd.merge(prices,resetted,how='outer',on='Datetime')
The result is very strange and seems wrong:
Open High Low Close Volume Datetime WeightedSentiment Popularity
0 236.9392 238.6095 236.5392 238.2094 315177.0 2012-11-23 10:00:00 0.0 20.0
1 238.1894 238.3095 236.7492 237.4993 122132.0 2012-11-23 10:30:00 0.0 12.0
2 237.4793 238.2595 237.1393 238.2094 144457.0 2012-11-23 11:00:00 0.0 12.0
3 238.2094 238.9196 238.1694 238.7695 131733.0 2012-11-23 11:30:00 0.0 2.0
4 238.7695 239.1396 237.9394 238.9496 150386.0 2012-11-23 12:00:00 0.0 12.0
5 238.7995 242.0301 238.0394 241.5900 1183601.0 2012-11-23 12:30:00 0.0 16.0
If I swap the two dataframes' position in the merge function, there will be NaN at head as expected, but the other rows are wrong. I have setup a demo notebook on github.
I'm on pandas 0.21.0

Related

Plotting events on a line graph

I am trying to visualise rain events using a data contained in a dataframe.
the idea seems very simple, but the execution seems to be impossible!
here is a part of the dataframe:
start_time end_time duration br_open_total
0 2022-01-01 10:00:00 2022-01-01 19:00:00 9.0 0.2540000563879943
1 2022-01-02 00:00:00 2022-01-02 10:00:00 10.0 1.0160002255520624
2 2022-01-02 17:00:00 2022-01-03 02:00:00 9.0 0.7620001691640113
3 2022-01-03 02:00:00 2022-01-04 12:00:00 34.0 10.668002368296513
4 2022-01-07 21:00:00 2022-01-08 06:00:00 9.0 0.2540000563879943
5 2022-01-16 05:00:00 2022-01-16 20:00:00 15.0 0.5080001127760454
6 2022-01-19 04:00:00 2022-01-19 17:00:00 13.0 0.7620001691640255
7 2022-01-21 14:00:00 2022-01-22 00:00:00 10.0 1.5240003383280751
8 2022-01-27 02:00:00 2022-01-27 16:00:00 14.0 3.0480006766561503
9 2022-02-01 12:00:00 2022-02-01 21:00:00 9.0 0.2540000563880126
10 2022-02-03 05:00:00 2022-02-03 15:00:00 10.0 0.5080001127760251
What I want to do is have a plot with time on the x axis, and the value of the 'br_open_total' on the y axis.
I can draw what I want it to look like, see below:
I apologise for the simplicity of the drawing, but I think it explains what I want to do.
How do I do this, and then repeat for other dataframes on the same plot.
I have tried staircase, matplotlib.pyplot.stair and others with no success.
It seems such a simple concept!
Edit 1:
Tried Joswin K J's answer with the actual data, and got this:
The event at 02-12 11:00 should be 112 hours duration, but the bar is the same width as all the others.
Edit2:
Tried Mozway's answer and got this:
Still doesn't show width of each event, and doesn't discretise the events either
Edit 3:
Using Mozway's amended answer I get this plot for the actual data:
I have added the cursor position using paint, at the top right of the plot you can see that the cursor is at 2022-02-09 and 20.34, which is actually the value for 2022-02-01, so it seems that the plot is shifted to the left by one data point?, also the large block between 2022-3-01 and 2022-04-03 doesn't seem to be in the data
edit 4: as requested by Mozway
Reshaped Data
duration br_open_total variable date
0 10.0 1.0160002255520624 start_time 2022-01-02 00:00:00
19 10.0 0.0 end_time 2022-01-02 10:00:00
1 9.0 0.7620001691640113 start_time 2022-01-02 17:00:00
2 34.0 10.668002368296513 start_time 2022-01-03 02:00:00
21 34.0 0.0 end_time 2022-01-04 12:00:00
3 15.0 0.5080001127760454 start_time 2022-01-16 05:00:00
22 15.0 0.0 end_time 2022-01-16 20:00:00
4 13.0 0.7620001691640255 start_time 2022-01-19 04:00:00
23 13.0 0.0 end_time 2022-01-19 17:00:00
5 10.0 1.5240003383280751 start_time 2022-01-21 14:00:00
24 10.0 0.0 end_time 2022-01-22 00:00:00
6 14.0 3.0480006766561503 start_time 2022-01-27 02:00:00
25 14.0 0.0 end_time 2022-01-27 16:00:00
7 10.0 0.5080001127760251 start_time 2022-02-03 05:00:00
26 10.0 0.0 end_time 2022-02-03 15:00:00
8 18.0 7.366001635252363 start_time 2022-02-03 23:00:00
27 18.0 0.0 end_time 2022-02-04 17:00:00
9 13.0 2.28600050749211 start_time 2022-02-05 11:00:00
28 13.0 0.0 end_time 2022-02-06 00:00:00
10 19.0 2.2860005074921173 start_time 2022-02-06 04:00:00
29 19.0 0.0 end_time 2022-02-06 23:00:00
11 13.0 1.2700002819400584 start_time 2022-02-07 11:00:00
30 13.0 0.0 end_time 2022-02-08 00:00:00
12 12.0 2.79400062026814 start_time 2022-02-09 01:00:00
31 12.0 0.0 end_time 2022-02-09 13:00:00
13 112.0 20.320004511041 start_time 2022-02-12 11:00:00
32 112.0 0.0 end_time 2022-02-17 03:00:00
14 28.0 2.0320004511041034 start_time 2022-02-18 14:00:00
33 28.0 0.0 end_time 2022-02-19 18:00:00
15 17.0 17.272003834384847 start_time 2022-02-23 17:00:00
34 17.0 0.0 end_time 2022-02-24 10:00:00
16 9.0 0.7620001691640397 start_time 2022-02-27 13:00:00
35 9.0 0.0 end_time 2022-02-27 22:00:00
17 18.0 4.0640009022082 start_time 2022-04-04 00:00:00
36 18.0 0.0 end_time 2022-04-04 18:00:00
18 15.0 1.0160002255520482 start_time 2022-04-06 05:00:00
37 15.0 0.0 end_time 2022-04-06 20:00:00
when plotted using
plt.step(bdf2['date'], bdf2['br_open_total'])
plt.gcf().set_size_inches(10, 4)
plt.xticks(rotation=90)
produces the plot shown above, in which the top left corner of a block corresponds to the previous data point.
edit 5: further info
When I plot all my dataframes (different sensors) I get the same differential on the event start and end times?
You can use a step plot:
# ensure datetime
df['start_time'] = pd.to_datetime(df['start_time'])
df['end_time'] = pd.to_datetime(df['end_time'])
# reshape the data
df2 = (df
.melt(id_vars=['duration', 'br_open_total'], value_name='date')
.sort_values(by='date')
.drop_duplicates(subset='date')
.assign(br_open_total=lambda d: d['br_open_total'].mask(d['variable'].eq('end_time'), 0))
)
# plot
import matplotlib.pyplot as plt
plt.step(df2['date'], df2['br_open_total'])
plt.gcf().set_size_inches(10, 4)
output:
reshaped data:
duration br_open_total variable date
0 9.0 0.254000 start_time 2022-01-01 10:00:00
11 9.0 0.000000 end_time 2022-01-01 19:00:00
1 10.0 1.016000 start_time 2022-01-02 00:00:00
12 10.0 0.000000 end_time 2022-01-02 10:00:00
2 9.0 0.762000 start_time 2022-01-02 17:00:00
3 34.0 10.668002 start_time 2022-01-03 02:00:00
14 34.0 0.000000 end_time 2022-01-04 12:00:00
4 9.0 0.254000 start_time 2022-01-07 21:00:00
15 9.0 0.000000 end_time 2022-01-08 06:00:00
5 15.0 0.508000 start_time 2022-01-16 05:00:00
16 15.0 0.000000 end_time 2022-01-16 20:00:00
6 13.0 0.762000 start_time 2022-01-19 04:00:00
17 13.0 0.000000 end_time 2022-01-19 17:00:00
7 10.0 1.524000 start_time 2022-01-21 14:00:00
18 10.0 0.000000 end_time 2022-01-22 00:00:00
8 14.0 3.048001 start_time 2022-01-27 02:00:00
19 14.0 0.000000 end_time 2022-01-27 16:00:00
9 9.0 0.254000 start_time 2022-02-01 12:00:00
20 9.0 0.000000 end_time 2022-02-01 21:00:00
10 10.0 0.508000 start_time 2022-02-03 05:00:00
21 10.0 0.000000 end_time 2022-02-03 15:00:00
Try this:
import matplotlib.pyplot as plt
for ind,row in df.iterrows():
plt.plot(pd.Series([row['start_time'],row['end_time']]),pd.Series([row['br_open_total'],row['br_open_total']]),color='b')
plt.plot(pd.Series([row['start_time'],row['start_time']]),pd.Series([0,row['br_open_total']]),color='b')
plt.plot(pd.Series([row['end_time'],row['end_time']]),pd.Series([0,row['br_open_total']]),color='b')
plt.xticks(rotation=90)
Result:
I believe I have now cracked it, with a great debt of thanks to #Mozway.
The code to restructure the dataframe for plotting:
#create dataframes of each open gauge events removing any event with an open total of less than 0.254mm
#bresser/open
bdftdf=bdf.loc[bdf['br_open_total'] > 0.255]
bdftdf=bdftdf.copy()
bdftdf['start_time'] = pd.to_datetime(bdftdf['start_time'])
bdftdf['end_time'] = pd.to_datetime(bdftdf['end_time'])
bdf2 = (bdftdf
.melt(id_vars=['duration', 'ic_total','mc_total','md_total','imd_total','oak_total','highpoint_total','school_total','br_open_total',
'fr_gauge_total','open_mean_total','br_open_ic_%_int','br_open_mc_%_int','br_open_md_%_int','br_open_imd_%_int',
'br_open_oak_%_int'], value_name='date')
.sort_values(by='date')
#.drop_duplicates(subset='date')
.assign(br_open_total=lambda d: d['br_open_total'].mask(d['variable'].eq('end_time'), 0))
)
#create array for stairs plot
bdfarr=np.array(bdf2['date'])
bl=len(bdf2)
bdfarr=np.append(bdfarr,[bdfarr[bl-1]+np.timedelta64(1,'h')])
Rather than use the plt.step plot as suggested by Mozway, I have used plt.stairs, after creating an array of the 'date' column in the dataframe and appending an extra element to that array equal to the last element =1hour.
This means that the data now plots as I had intended it to.:
code for plot:
fig1=plt.figure()
plt.stairs(bdf2['br_open_total'], bdfarr, label='Bresser\Open')
plt.stairs(frdf2['fr_gauge_total'], frdfarr, label='FR Gauge')
plt.stairs(hpdf2['highpoint_total'], hpdfarr, label='Highpoint')
plt.stairs(schdf2['school_total'], schdfarr, label='School')
plt.stairs(opmedf2['open_mean_total'], opmedfarr, label='Open mean')
plt.xticks(rotation=90)
plt.legend(title='Rain events', loc='best')
plt.show()

Pandas Resample 5 mins data to Hourly average : Date issue [duplicate]

This question already has answers here:
Pandas: Datetime Improperly selecting day as month from date [duplicate]
(2 answers)
Closed 1 year ago.
I am trying to resample a timeseries data from 5 min frequency to hourly average.
df = pd.read_csv("my_data.csv", index_col=False, usecols=['A','B','C'])
output:
A B C
0 16-01-21 0:00 95.75 0.0
1 16-01-21 0:05 90.10 0.0
2 16-01-21 0:10 86.26 0.0
3 16-01-21 0:15 92.72 0.0
4 16-01-21 0:20 81.54 0.0
df.A= pd.to_datetime(df.A)
Output:
A B C
0 2021-01-16 00:00:00 95.75 0.0
1 2021-01-16 00:05:00 90.10 0.0
2 2021-01-16 00:10:00 86.26 0.0
3 2021-01-16 00:15:00 92.72 0.0
4 2021-01-16 00:20:00 81.54 0.0
Now I set the Timestamp column as index,
df.set_index('A', inplace=True)
And when I try to resample with
df2 = df.resample('H').mean()
I am getting this,
B C
A
2021-01-02 00:00:00 79.970278 0.0
2021-01-02 01:00:00 77.951667 0.0
2021-01-02 02:00:00 77.610556 0.0
2021-01-02 03:00:00 80.800000 0.0
2021-01-02 04:00:00 84.305000 0.0
Was expecting this kind of timestamp with the average values for each hour,
A B C
2021-01-16 00:00:00 79.970278 0.0
2021-01-16 01:00:00 77.951667 0.0
2021-01-16 02:00:00 77.610556 0.0
2021-01-16 03:00:00 80.800000 0.0
2021-01-16 04:00:00 84.305000 0.0
I am not sure where I am making a mistake. Help me out.
I think problem here is some datetimes are wrongly converted:
#default is month first in df.A= pd.to_datetime(df.A)
01-02-21 -> 2021-01-02
Possible solutions:
df.A= pd.to_datetime(df.A, dayfirst=True)
Or:
df = pd.read_csv("my_data.csv",
index_col=False,
usecols=['A','B','C'],
parse_dates=['A'],
dayfirst=True)

How to impute missing value in time series data with the value of the same day and time from the previous week(day) in python

I have a dataframe with columns of timestamp and energy usage. The timestamp is taken for every min of the day i.e., a total of 1440 readings for each day. I have few missing values in the data frame.
I want to impute those missing values with the mean of the same day, same time from the last two or three week. This way if the previous week is also missing, I can use the value for two weeks ago.
Here's a example of the data:
mains_1
timestamp
2013-01-03 00:00:00 155.00
2013-01-03 00:01:00 154.00
2013-01-03 00:02:00 NaN
2013-01-03 00:03:00 154.00
2013-01-03 00:04:00 153.00
... ...
2013-04-30 23:55:00 NaN
2013-04-30 23:56:00 182.00
2013-04-30 23:57:00 181.00
2013-04-30 23:58:00 182.00
2013-04-30 23:59:00 182.00
Right now I have this line of code:
df['mains_1'] = (df
.groupby((df.index.dayofweek * 24) + (df.index.hour) + (df.index.minute / 60))
.transform(lambda x: x.fillna(x.mean()))
)
So what this does is it uses the average of the usage from the same hour of the day on the whole dataset. I want it to be more precise and use the average of the last two or three weeks.
You can concat together the Series with shift in a loop, as the index alignment will ensure it's matching on the previous weeks with the same hour. Then take the mean and use .fillna to update the original
Sample Data
import pandas as pd
import numpy as np
np.random.seed(5)
df = pd.DataFrame(index=pd.date_range('2010-01-01 10:00:00', freq='W', periods=10),
data = np.random.choice([1,2,3,4, np.NaN], 10),
columns=['mains_1'])
# mains_1
#2010-01-03 10:00:00 4.0
#2010-01-10 10:00:00 1.0
#2010-01-17 10:00:00 2.0
#2010-01-24 10:00:00 1.0
#2010-01-31 10:00:00 NaN
#2010-02-07 10:00:00 4.0
#2010-02-14 10:00:00 1.0
#2010-02-21 10:00:00 1.0
#2010-02-28 10:00:00 NaN
#2010-03-07 10:00:00 2.0
Code
# range(4) for previous 3 weeks.
df1 = pd.concat([df.shift(periods=x, freq='W') for x in range(4)], axis=1)
# mains_1 mains_1 mains_1 mains_1
#2010-01-03 10:00:00 4.0 NaN NaN NaN
#2010-01-10 10:00:00 1.0 4.0 NaN NaN
#2010-01-17 10:00:00 2.0 1.0 4.0 NaN
#2010-01-24 10:00:00 1.0 2.0 1.0 4.0
#2010-01-31 10:00:00 NaN 1.0 2.0 1.0
#2010-02-07 10:00:00 4.0 NaN 1.0 2.0
#2010-02-14 10:00:00 1.0 4.0 NaN 1.0
#2010-02-21 10:00:00 1.0 1.0 4.0 NaN
#2010-02-28 10:00:00 NaN 1.0 1.0 4.0
#2010-03-07 10:00:00 2.0 NaN 1.0 1.0
#2010-03-14 10:00:00 NaN 2.0 NaN 1.0
#2010-03-21 10:00:00 NaN NaN 2.0 NaN
#2010-03-28 10:00:00 NaN NaN NaN 2.0
df['mains_1'] = df['mains_1'].fillna(df1.mean(axis=1))
print(df)
mains_1
2010-01-03 10:00:00 4.000000
2010-01-10 10:00:00 1.000000
2010-01-17 10:00:00 2.000000
2010-01-24 10:00:00 1.000000
2010-01-31 10:00:00 1.333333
2010-02-07 10:00:00 4.000000
2010-02-14 10:00:00 1.000000
2010-02-21 10:00:00 1.000000
2010-02-28 10:00:00 2.000000
2010-03-07 10:00:00 2.000000

Reverse position of entries in pandas dataframe based on condition

Here I have an extract from my pandas dataframe which is survey data with two datetime fields. It appears that some of the start times and end times were filled in the wrong position in the survey. Here is an example from my dataframe. The start and end time in the 8th row, I suspect were entered the wrong way round.
Just to give context, I generated the third column like this:
df_time['trip_duration'] = df_time['tripEnd_time'] - df_time['tripStart_time']
The three columns are in timedelta64 format.
Here is the top of my dataframe:
tripStart_time tripEnd_time trip_duration
1 22:30:00 23:15:00 00:45:00
2 11:00:00 11:30:00 00:30:00
3 09:00:00 09:15:00 00:15:00
4 13:30:00 14:25:00 00:55:00
5 09:00:00 10:15:00 01:15:00
6 12:00:00 12:15:00 00:15:00
7 08:00:00 08:30:00 00:30:00
8 11:00:00 09:15:00 -1 days +22:15:00
9 14:00:00 14:30:00 00:30:00
10 14:55:00 15:20:00 00:25:00
What I am trying to do is, loop through these two columns, and for each time 'tripEnd_time' is less than 'tripStart_time' swap the positions of these two entries. So in the case of row 8 above, I would make tripStart_time = tripEnd_time and tripEnd_time = tripStart_time.
I am not quite sure the best way to approach this. Should I use nested for loop where i compare each entry in the two columns?
Thanks
Use Series.abs:
df_time['trip_duration'] = (df_time['tripEnd_time'] - df_time['tripStart_time']).abs()
print (df_time)
1 22:30:00 23:15:00 00:45:00
2 11:00:00 11:30:00 00:30:00
3 09:00:00 09:15:00 00:15:00
4 13:30:00 14:25:00 00:55:00
5 09:00:00 10:15:00 01:15:00
6 12:00:00 12:15:00 00:15:00
7 08:00:00 08:30:00 00:30:00
8 11:00:00 09:15:00 01:45:00
9 14:00:00 14:30:00 00:30:00
10 14:55:00 15:20:00 00:25:00
What is same like:
a = df_time['tripEnd_time'] - df_time['tripStart_time']
b = df_time['tripStart_time'] - df_time['tripEnd_time']
mask = df_time['tripEnd_time'] > df_time['tripStart_time']
df_time['trip_duration'] = np.where(mask, a, b)
print (df_time)
tripStart_time tripEnd_time trip_duration
1 22:30:00 23:15:00 00:45:00
2 11:00:00 11:30:00 00:30:00
3 09:00:00 09:15:00 00:15:00
4 13:30:00 14:25:00 00:55:00
5 09:00:00 10:15:00 01:15:00
6 12:00:00 12:15:00 00:15:00
7 08:00:00 08:30:00 00:30:00
8 11:00:00 09:15:00 01:45:00
9 14:00:00 14:30:00 00:30:00
10 14:55:00 15:20:00 00:25:00
You can switch column values on selected rows:
df_time.loc[df_time['tripEnd_time'] < df_time['tripStart_time'],
['tripStart_time', 'tripEnd_time']] = df_time.loc[
df_time['tripEnd_time'] < df_time['tripStart_time'],
['tripEnd_time', 'tripStart_time']].values

Pandas: correctly resampling data at the hourly frequency

In Python 3.6.3, I have the following dataframe df1:
dt Val
2017-04-10 08:00:00 8.0
2017-04-10 09:00:00 2.0
2017-04-10 10:00:00 7.0
2017-04-11 08:00:00 3.0
2017-04-11 09:00:00 0.0
2017-04-11 10:00:00 5.0
2017-11-26 08:00:00 8.0
2017-11-26 09:00:00 1.0
2017-11-26 10:00:00 2.0
I am trying to compute the hourly average of these values, so as to have:
Hour Val
08:00:00 7.00
09:00:00 1.00
10:00:00 4.66
My attempt:
df2 = df1.resample('H')['Val'].mean()
Returns the same dataset as df1. What am I doing wrong?
Inspired by the comments above, I tested that the following works for me:
df.groupby(df.index.hour).Val.mean()
Or you can make the index values 'timedelta' dtypes
df.Val.groupby(df.index.hour.astype('timedelta64[h]')).mean()
dt
08:00:00 6.333333
09:00:00 1.000000
10:00:00 4.666667
Name: Val, dtype: float64

Categories