I have a datatime dataframe. I want to compare it with a reference date and assign before it is less than and after if greater.
My code:
df = pd.DataFrame({'A':np.arange(1.0,9.0)},index=pd.date_range(start='2020-05-04 08:00:00', freq='1d', periods=8))
df=
A
2020-05-04 08:00:00 1.0
2020-05-05 08:00:00 2.0
2020-05-06 08:00:00 3.0
2020-05-07 08:00:00 4.0
2020-05-08 08:00:00 5.0
2020-05-09 08:00:00 6.0
2020-05-10 08:00:00 7.0
2020-05-11 08:00:00 8.0
ref_date = '2020-05-08'
Expected answer
df=
A Condi.
2020-05-04 08:00:00 1.0 Before
2020-05-05 08:00:00 2.0 Before
2020-05-06 08:00:00 3.0 Before
2020-05-07 08:00:00 4.0 Before
2020-05-08 08:00:00 5.0 After
2020-05-09 08:00:00 6.0 After
2020-05-10 08:00:00 7.0 After
2020-05-11 08:00:00 8.0 After
My solution:
df['Cond.'] = = ['After' if df.index>=(ref_date)=='True' else 'Before cleaning']
Present answer
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
ref_date = "2020-05-08"
df["Cond"] = np.where(df.index < ref_date, "Before", "After")
print(df)
Prints:
A Cond
2020-05-04 08:00:00 1.0 Before
2020-05-05 08:00:00 2.0 Before
2020-05-06 08:00:00 3.0 Before
2020-05-07 08:00:00 4.0 Before
2020-05-08 08:00:00 5.0 After
2020-05-09 08:00:00 6.0 After
2020-05-10 08:00:00 7.0 After
2020-05-11 08:00:00 8.0 After
Helping you fix the list comprehension approach you took, you can use:
df['Cond'] = ['After' if x >= pd.to_datetime(ref_date) else 'Before' for x in df.index]
Related
I have a dataframe which has two columns. date and value.
import pandas as pd
import numpy as np
df = pd.DataFrame()
df['date'] = ['2020-03-01 00:00:00','2020-03-01 00:00:15', '2020-03-01 00:00:30', '2020-03-02 00:00:00','2020-03-02 00:00:15', '2020-03-02 00:00:30' , '2020-03-03 00:00:15', '2020-03-03 00:00:30', '2020-03-05 00:00:00', '2020-03-05 00:00:30']
df['value'] = [1, 2, 3, 4, 5, 6, 1, 2, 3, 4]
df
date value
0 2020-03-01 00:00:00 1
1 2020-03-01 00:00:15 2
2 2020-03-01 00:00:30 3
3 2020-03-02 00:00:00 4
4 2020-03-02 00:00:15 5
5 2020-03-02 00:00:30 6
6 2020-03-03 00:00:15 1
7 2020-03-03 00:00:30 2
8 2020-03-05 00:00:00 3
9 2020-03-05 00:00:30 4
in the date column, I have some missing values (I want all the days, like 1-2-3-4-... but in this example I dont have day 2020-03-4, so I put nan for that), so I want to build this df at first which shows me the days which I dont have their data:
day 00:00:00 00:00:15 00:00:30
0 2020-03-01 1.0 2.0 3.0
1 2020-03-02 4.0 5.0 6.0
2 2020-03-03 NaN 1.0 2.0
3 2020-03-04 NaN NaN NaN
4 2020-03-05 3.0 NaN 4.0
Then replace the Nan values with mean of columns, like:
day 00:00:00 00:00:15 00:00:30
0 2020-03-01 1.000000 2.000000 3.000000
1 2020-03-02 4.000000 5.000000 6.000000
2 2020-03-03 2.666667 1.000000 2.000000
3 2020-03-04 2.666667 2.666667 2.666667
4 2020-03-05 3.000000 5.000000 4.000000
And then build one df with one row as(the name of columns is not important)
1 2 3 4 5 6 7 8 9 10 11 12 13 14
0 1 2 4 5 6 2.67 1 2 2.67 2.67 2.67 3 5 4
I am working with pivot and groupby, but I could not solve it. Especially the missing date. Could you please help me with that?
you can use resample():
df['date']=pd.to_datetime(df['date'])
dfx=df.set_index('date').resample('15S').first()
We got the distribution of all hours of the day. But we only need values between 00:00:00 and 00:00:30.
dfx = dfx.between_time("00:00:00", "00:00:30").reset_index()
print(dfx)
'''
date value
0 2020-03-01 00:00:00 1.0
1 2020-03-01 00:00:15 2.0
2 2020-03-01 00:00:30 3.0
3 2020-03-02 00:00:00 4.0
4 2020-03-02 00:00:15 5.0
5 2020-03-02 00:00:30 6.0
6 2020-03-03 00:00:00
7 2020-03-03 00:00:15 1.0
8 2020-03-03 00:00:30 2.0
9 2020-03-04 00:00:00
10 2020-03-04 00:00:15
11 2020-03-04 00:00:30
12 2020-03-05 00:00:00 3.0
13 2020-03-05 00:00:15
14 2020-03-05 00:00:30 4.0
'''
Then i convert times into columns using crosstab:
dfx=pd.crosstab(dfx['date'].dt.date, dfx['date'].dt.time,values=dfx['value'],aggfunc='sum',dropna=False)
print(dfx)
'''
date 00:00:00 00:00:15 00:00:30
date
2020-03-01 1.0 2.0 3.0
2020-03-02 4.0 5.0 6.0
2020-03-03 0.0 1.0 2.0
2020-03-04 0.0 0.0 0.0
2020-03-05 3.0 0.0 4.0
'''
Values with 0 are times that are not in the data set. I replace them with nan and populate them with the column averages:
dfx=dfx.replace(0,np.nan)
for i in dfx.columns:
dfx[i]=dfx[i].fillna(dfx[i].mean())
print(dfx)
'''
date 00:00:00 00:00:15 00:00:30
date
2020-03-01 1.000000 2.000000 3.00
2020-03-02 4.000000 5.000000 6.00
2020-03-03 2.666667 1.000000 2.00
2020-03-04 2.666667 2.666667 3.75
2020-03-05 3.000000 2.666667 4.00
'''
I did not fully understand what you want at the last stage, if you write it in detail, I will edit my answer.
I am trying to visualise rain events using a data contained in a dataframe.
the idea seems very simple, but the execution seems to be impossible!
here is a part of the dataframe:
start_time end_time duration br_open_total
0 2022-01-01 10:00:00 2022-01-01 19:00:00 9.0 0.2540000563879943
1 2022-01-02 00:00:00 2022-01-02 10:00:00 10.0 1.0160002255520624
2 2022-01-02 17:00:00 2022-01-03 02:00:00 9.0 0.7620001691640113
3 2022-01-03 02:00:00 2022-01-04 12:00:00 34.0 10.668002368296513
4 2022-01-07 21:00:00 2022-01-08 06:00:00 9.0 0.2540000563879943
5 2022-01-16 05:00:00 2022-01-16 20:00:00 15.0 0.5080001127760454
6 2022-01-19 04:00:00 2022-01-19 17:00:00 13.0 0.7620001691640255
7 2022-01-21 14:00:00 2022-01-22 00:00:00 10.0 1.5240003383280751
8 2022-01-27 02:00:00 2022-01-27 16:00:00 14.0 3.0480006766561503
9 2022-02-01 12:00:00 2022-02-01 21:00:00 9.0 0.2540000563880126
10 2022-02-03 05:00:00 2022-02-03 15:00:00 10.0 0.5080001127760251
What I want to do is have a plot with time on the x axis, and the value of the 'br_open_total' on the y axis.
I can draw what I want it to look like, see below:
I apologise for the simplicity of the drawing, but I think it explains what I want to do.
How do I do this, and then repeat for other dataframes on the same plot.
I have tried staircase, matplotlib.pyplot.stair and others with no success.
It seems such a simple concept!
Edit 1:
Tried Joswin K J's answer with the actual data, and got this:
The event at 02-12 11:00 should be 112 hours duration, but the bar is the same width as all the others.
Edit2:
Tried Mozway's answer and got this:
Still doesn't show width of each event, and doesn't discretise the events either
Edit 3:
Using Mozway's amended answer I get this plot for the actual data:
I have added the cursor position using paint, at the top right of the plot you can see that the cursor is at 2022-02-09 and 20.34, which is actually the value for 2022-02-01, so it seems that the plot is shifted to the left by one data point?, also the large block between 2022-3-01 and 2022-04-03 doesn't seem to be in the data
edit 4: as requested by Mozway
Reshaped Data
duration br_open_total variable date
0 10.0 1.0160002255520624 start_time 2022-01-02 00:00:00
19 10.0 0.0 end_time 2022-01-02 10:00:00
1 9.0 0.7620001691640113 start_time 2022-01-02 17:00:00
2 34.0 10.668002368296513 start_time 2022-01-03 02:00:00
21 34.0 0.0 end_time 2022-01-04 12:00:00
3 15.0 0.5080001127760454 start_time 2022-01-16 05:00:00
22 15.0 0.0 end_time 2022-01-16 20:00:00
4 13.0 0.7620001691640255 start_time 2022-01-19 04:00:00
23 13.0 0.0 end_time 2022-01-19 17:00:00
5 10.0 1.5240003383280751 start_time 2022-01-21 14:00:00
24 10.0 0.0 end_time 2022-01-22 00:00:00
6 14.0 3.0480006766561503 start_time 2022-01-27 02:00:00
25 14.0 0.0 end_time 2022-01-27 16:00:00
7 10.0 0.5080001127760251 start_time 2022-02-03 05:00:00
26 10.0 0.0 end_time 2022-02-03 15:00:00
8 18.0 7.366001635252363 start_time 2022-02-03 23:00:00
27 18.0 0.0 end_time 2022-02-04 17:00:00
9 13.0 2.28600050749211 start_time 2022-02-05 11:00:00
28 13.0 0.0 end_time 2022-02-06 00:00:00
10 19.0 2.2860005074921173 start_time 2022-02-06 04:00:00
29 19.0 0.0 end_time 2022-02-06 23:00:00
11 13.0 1.2700002819400584 start_time 2022-02-07 11:00:00
30 13.0 0.0 end_time 2022-02-08 00:00:00
12 12.0 2.79400062026814 start_time 2022-02-09 01:00:00
31 12.0 0.0 end_time 2022-02-09 13:00:00
13 112.0 20.320004511041 start_time 2022-02-12 11:00:00
32 112.0 0.0 end_time 2022-02-17 03:00:00
14 28.0 2.0320004511041034 start_time 2022-02-18 14:00:00
33 28.0 0.0 end_time 2022-02-19 18:00:00
15 17.0 17.272003834384847 start_time 2022-02-23 17:00:00
34 17.0 0.0 end_time 2022-02-24 10:00:00
16 9.0 0.7620001691640397 start_time 2022-02-27 13:00:00
35 9.0 0.0 end_time 2022-02-27 22:00:00
17 18.0 4.0640009022082 start_time 2022-04-04 00:00:00
36 18.0 0.0 end_time 2022-04-04 18:00:00
18 15.0 1.0160002255520482 start_time 2022-04-06 05:00:00
37 15.0 0.0 end_time 2022-04-06 20:00:00
when plotted using
plt.step(bdf2['date'], bdf2['br_open_total'])
plt.gcf().set_size_inches(10, 4)
plt.xticks(rotation=90)
produces the plot shown above, in which the top left corner of a block corresponds to the previous data point.
edit 5: further info
When I plot all my dataframes (different sensors) I get the same differential on the event start and end times?
You can use a step plot:
# ensure datetime
df['start_time'] = pd.to_datetime(df['start_time'])
df['end_time'] = pd.to_datetime(df['end_time'])
# reshape the data
df2 = (df
.melt(id_vars=['duration', 'br_open_total'], value_name='date')
.sort_values(by='date')
.drop_duplicates(subset='date')
.assign(br_open_total=lambda d: d['br_open_total'].mask(d['variable'].eq('end_time'), 0))
)
# plot
import matplotlib.pyplot as plt
plt.step(df2['date'], df2['br_open_total'])
plt.gcf().set_size_inches(10, 4)
output:
reshaped data:
duration br_open_total variable date
0 9.0 0.254000 start_time 2022-01-01 10:00:00
11 9.0 0.000000 end_time 2022-01-01 19:00:00
1 10.0 1.016000 start_time 2022-01-02 00:00:00
12 10.0 0.000000 end_time 2022-01-02 10:00:00
2 9.0 0.762000 start_time 2022-01-02 17:00:00
3 34.0 10.668002 start_time 2022-01-03 02:00:00
14 34.0 0.000000 end_time 2022-01-04 12:00:00
4 9.0 0.254000 start_time 2022-01-07 21:00:00
15 9.0 0.000000 end_time 2022-01-08 06:00:00
5 15.0 0.508000 start_time 2022-01-16 05:00:00
16 15.0 0.000000 end_time 2022-01-16 20:00:00
6 13.0 0.762000 start_time 2022-01-19 04:00:00
17 13.0 0.000000 end_time 2022-01-19 17:00:00
7 10.0 1.524000 start_time 2022-01-21 14:00:00
18 10.0 0.000000 end_time 2022-01-22 00:00:00
8 14.0 3.048001 start_time 2022-01-27 02:00:00
19 14.0 0.000000 end_time 2022-01-27 16:00:00
9 9.0 0.254000 start_time 2022-02-01 12:00:00
20 9.0 0.000000 end_time 2022-02-01 21:00:00
10 10.0 0.508000 start_time 2022-02-03 05:00:00
21 10.0 0.000000 end_time 2022-02-03 15:00:00
Try this:
import matplotlib.pyplot as plt
for ind,row in df.iterrows():
plt.plot(pd.Series([row['start_time'],row['end_time']]),pd.Series([row['br_open_total'],row['br_open_total']]),color='b')
plt.plot(pd.Series([row['start_time'],row['start_time']]),pd.Series([0,row['br_open_total']]),color='b')
plt.plot(pd.Series([row['end_time'],row['end_time']]),pd.Series([0,row['br_open_total']]),color='b')
plt.xticks(rotation=90)
Result:
I believe I have now cracked it, with a great debt of thanks to #Mozway.
The code to restructure the dataframe for plotting:
#create dataframes of each open gauge events removing any event with an open total of less than 0.254mm
#bresser/open
bdftdf=bdf.loc[bdf['br_open_total'] > 0.255]
bdftdf=bdftdf.copy()
bdftdf['start_time'] = pd.to_datetime(bdftdf['start_time'])
bdftdf['end_time'] = pd.to_datetime(bdftdf['end_time'])
bdf2 = (bdftdf
.melt(id_vars=['duration', 'ic_total','mc_total','md_total','imd_total','oak_total','highpoint_total','school_total','br_open_total',
'fr_gauge_total','open_mean_total','br_open_ic_%_int','br_open_mc_%_int','br_open_md_%_int','br_open_imd_%_int',
'br_open_oak_%_int'], value_name='date')
.sort_values(by='date')
#.drop_duplicates(subset='date')
.assign(br_open_total=lambda d: d['br_open_total'].mask(d['variable'].eq('end_time'), 0))
)
#create array for stairs plot
bdfarr=np.array(bdf2['date'])
bl=len(bdf2)
bdfarr=np.append(bdfarr,[bdfarr[bl-1]+np.timedelta64(1,'h')])
Rather than use the plt.step plot as suggested by Mozway, I have used plt.stairs, after creating an array of the 'date' column in the dataframe and appending an extra element to that array equal to the last element =1hour.
This means that the data now plots as I had intended it to.:
code for plot:
fig1=plt.figure()
plt.stairs(bdf2['br_open_total'], bdfarr, label='Bresser\Open')
plt.stairs(frdf2['fr_gauge_total'], frdfarr, label='FR Gauge')
plt.stairs(hpdf2['highpoint_total'], hpdfarr, label='Highpoint')
plt.stairs(schdf2['school_total'], schdfarr, label='School')
plt.stairs(opmedf2['open_mean_total'], opmedfarr, label='Open mean')
plt.xticks(rotation=90)
plt.legend(title='Rain events', loc='best')
plt.show()
I have a dataframe with columns of timestamp and energy usage. The timestamp is taken for every min of the day i.e., a total of 1440 readings for each day. I have few missing values in the data frame.
I want to impute those missing values with the mean of the same day, same time from the last two or three week. This way if the previous week is also missing, I can use the value for two weeks ago.
Here's a example of the data:
mains_1
timestamp
2013-01-03 00:00:00 155.00
2013-01-03 00:01:00 154.00
2013-01-03 00:02:00 NaN
2013-01-03 00:03:00 154.00
2013-01-03 00:04:00 153.00
... ...
2013-04-30 23:55:00 NaN
2013-04-30 23:56:00 182.00
2013-04-30 23:57:00 181.00
2013-04-30 23:58:00 182.00
2013-04-30 23:59:00 182.00
Right now I have this line of code:
df['mains_1'] = (df
.groupby((df.index.dayofweek * 24) + (df.index.hour) + (df.index.minute / 60))
.transform(lambda x: x.fillna(x.mean()))
)
So what this does is it uses the average of the usage from the same hour of the day on the whole dataset. I want it to be more precise and use the average of the last two or three weeks.
You can concat together the Series with shift in a loop, as the index alignment will ensure it's matching on the previous weeks with the same hour. Then take the mean and use .fillna to update the original
Sample Data
import pandas as pd
import numpy as np
np.random.seed(5)
df = pd.DataFrame(index=pd.date_range('2010-01-01 10:00:00', freq='W', periods=10),
data = np.random.choice([1,2,3,4, np.NaN], 10),
columns=['mains_1'])
# mains_1
#2010-01-03 10:00:00 4.0
#2010-01-10 10:00:00 1.0
#2010-01-17 10:00:00 2.0
#2010-01-24 10:00:00 1.0
#2010-01-31 10:00:00 NaN
#2010-02-07 10:00:00 4.0
#2010-02-14 10:00:00 1.0
#2010-02-21 10:00:00 1.0
#2010-02-28 10:00:00 NaN
#2010-03-07 10:00:00 2.0
Code
# range(4) for previous 3 weeks.
df1 = pd.concat([df.shift(periods=x, freq='W') for x in range(4)], axis=1)
# mains_1 mains_1 mains_1 mains_1
#2010-01-03 10:00:00 4.0 NaN NaN NaN
#2010-01-10 10:00:00 1.0 4.0 NaN NaN
#2010-01-17 10:00:00 2.0 1.0 4.0 NaN
#2010-01-24 10:00:00 1.0 2.0 1.0 4.0
#2010-01-31 10:00:00 NaN 1.0 2.0 1.0
#2010-02-07 10:00:00 4.0 NaN 1.0 2.0
#2010-02-14 10:00:00 1.0 4.0 NaN 1.0
#2010-02-21 10:00:00 1.0 1.0 4.0 NaN
#2010-02-28 10:00:00 NaN 1.0 1.0 4.0
#2010-03-07 10:00:00 2.0 NaN 1.0 1.0
#2010-03-14 10:00:00 NaN 2.0 NaN 1.0
#2010-03-21 10:00:00 NaN NaN 2.0 NaN
#2010-03-28 10:00:00 NaN NaN NaN 2.0
df['mains_1'] = df['mains_1'].fillna(df1.mean(axis=1))
print(df)
mains_1
2010-01-03 10:00:00 4.000000
2010-01-10 10:00:00 1.000000
2010-01-17 10:00:00 2.000000
2010-01-24 10:00:00 1.000000
2010-01-31 10:00:00 1.333333
2010-02-07 10:00:00 4.000000
2010-02-14 10:00:00 1.000000
2010-02-21 10:00:00 1.000000
2010-02-28 10:00:00 2.000000
2010-03-07 10:00:00 2.000000
If I had some random data created on a one hour sample..
import pandas as pd
import numpy as np
from numpy.random import randint
np.random.seed(10) # added for reproductibility
rng = pd.date_range('10/9/2018 00:00', periods=1000, freq='1H')
df = pd.DataFrame({'Random_Number':randint(1, 100, 1000)}, index=rng)
I can use the groupby to break out each day:
for idx, day in df.groupby(df.index.date):
print(day)
Now is there a way to calculate the time difference between daily min & max value based on the timestamp in hours? for each day record the daily min & max & time difference?
After some discussion (thanks #Erfan):
(df.Random_Number
.groupby(df.index.date)
.agg(['idxmin','idxmax'])
.diff(axis=1).iloc[:,1]
.div(pd.to_timedelta('1H'))
)
Output:
2018-10-09 -4.0
2018-10-10 -1.0
2018-10-11 -4.0
2018-10-12 12.0
2018-10-13 21.0
2018-10-14 6.0
2018-10-15 -6.0
2018-10-16 -18.0
2018-10-17 -8.0
2018-10-18 9.0
2018-10-19 -10.0
2018-10-20 3.0
2018-10-21 10.0
2018-10-22 2.0
2018-10-23 9.0
2018-10-24 2.0
2018-10-25 3.0
2018-10-26 2.0
2018-10-27 -22.0
2018-10-28 6.0
2018-10-29 -8.0
2018-10-30 -1.0
2018-10-31 -11.0
2018-11-01 19.0
2018-11-02 7.0
2018-11-03 4.0
2018-11-04 18.0
2018-11-05 -1.0
2018-11-06 15.0
2018-11-07 -14.0
2018-11-08 -16.0
2018-11-09 -2.0
2018-11-10 -7.0
2018-11-11 -14.0
2018-11-12 12.0
2018-11-13 -14.0
2018-11-14 2.0
2018-11-15 2.0
2018-11-16 6.0
2018-11-17 -7.0
2018-11-18 5.0
2018-11-19 9.0
Name: idxmax, dtype: float64
Alternatively, should you want to retain all columns with data frame output, consider merging on an aggregated dataset:
# ADJUST FOR DATETIME AND DATE AS COLUMNS
df = (df.reset_index()
.assign(date = df.index.date)
)
# AGGREGATION + MERGE ON MIN/MAX + CALCULATION
agg_df = (df.groupby('date')['Random_Number']
.agg(["min", "max"])
.reset_index()
.merge(df, left_on=['date', 'max'], right_on=['date', 'Random_Number'])
.merge(df, left_on=['date', 'min'], right_on=['date', 'Random_Number'],
suffixes=['', '_min'])
.assign(diff = lambda x: (x['index'] - x['index_min']) / pd.to_timedelta('1H'))
)
print(agg_df.head(10))
# date min max index Random_Number index_min Random_Number_min diff
# 0 2018-10-09 1 94 2018-10-09 05:00:00 94 2018-10-09 09:00:00 1 -4.0
# 1 2018-10-10 12 95 2018-10-10 20:00:00 95 2018-10-10 21:00:00 12 -1.0
# 2 2018-10-11 5 97 2018-10-11 15:00:00 97 2018-10-11 19:00:00 5 -4.0
# 3 2018-10-12 7 98 2018-10-12 18:00:00 98 2018-10-12 06:00:00 7 12.0
# 4 2018-10-13 1 91 2018-10-13 22:00:00 91 2018-10-13 01:00:00 1 21.0
# 5 2018-10-14 1 97 2018-10-14 10:00:00 97 2018-10-14 04:00:00 1 6.0
# 6 2018-10-15 9 97 2018-10-15 06:00:00 97 2018-10-15 12:00:00 9 -6.0
# 7 2018-10-16 3 95 2018-10-16 04:00:00 95 2018-10-16 22:00:00 3 -18.0
# 8 2018-10-17 2 95 2018-10-17 13:00:00 95 2018-10-17 21:00:00 2 -8.0
# 9 2018-10-18 1 91 2018-10-18 21:00:00 91 2018-10-18 12:00:00 1 9.0
In Python 3.6.3, I have the following dataframe df1:
dt Val
2017-04-10 08:00:00 8.0
2017-04-10 09:00:00 2.0
2017-04-10 10:00:00 7.0
2017-04-11 08:00:00 3.0
2017-04-11 09:00:00 0.0
2017-04-11 10:00:00 5.0
2017-11-26 08:00:00 8.0
2017-11-26 09:00:00 1.0
2017-11-26 10:00:00 2.0
I am trying to compute the hourly average of these values, so as to have:
Hour Val
08:00:00 7.00
09:00:00 1.00
10:00:00 4.66
My attempt:
df2 = df1.resample('H')['Val'].mean()
Returns the same dataset as df1. What am I doing wrong?
Inspired by the comments above, I tested that the following works for me:
df.groupby(df.index.hour).Val.mean()
Or you can make the index values 'timedelta' dtypes
df.Val.groupby(df.index.hour.astype('timedelta64[h]')).mean()
dt
08:00:00 6.333333
09:00:00 1.000000
10:00:00 4.666667
Name: Val, dtype: float64