So I have two data frames.
Energy: (100 columns)
Affluent Adversity Affluent Comfortable Adversity \
Time
2019-01-01 01:00:00 0.254 0.244 0.155 0.215 0.274
2019-01-01 02:00:00 0.346 0.154 0.083 0.246 0.046
2019-01-01 03:00:00 0.309 0.116 0.085 0.220 0.139
2019-01-01 04:00:00 0.302 0.158 0.083 0.226 0.186
2019-01-01 05:00:00 0.181 0.171 0.096 0.246 0.051
... ... ... ... ... ...
2019-12-31 20:00:00 1.102 0.263 2.157 0.209 2.856
2019-12-31 21:00:00 0.712 0.269 1.409 0.212 0.497
2019-12-31 22:00:00 0.398 0.274 0.073 0.277 0.199
2019-12-31 23:00:00 0.449 0.452 0.072 0.252 0.183
2020-01-01 00:00:00 0.466 0.291 0.110 0.203 0.117
loadshift: (1 column)
Time load_difference
2019-01-01 01:00:00 0.10
2019-01-01 02:00:00 0.10
2019-01-01 03:00:00 0.15
2019-01-01 04:00:00 0.10
2019-01-01 05:00:00 0.10
... ...
2019-12-31 20:00:00 -0.10
2019-12-31 21:00:00 0.10
2019-12-31 22:00:00 0.15
2019-12-31 23:00:00 0.10
2020-01-01 00:00:00 -0.10
all I want to do is add the load difference to the df1 so for example the first affluent house at 1 am would change to 0.345. I have been able to use concat to multiply in my other models but somehow really struggling with this.
Expected output(but for all 8760 hours):
Affluent Adversity Affluent Comfortable Adversity \
Time
2019-01-01 01:00:00 0.354 0.344 0.255 0.315 0.374
2019-01-01 02:00:00 0.446 0.254 0.183 0.446 0.146
2019-01-01 03:00:00 0.409 0.216 0.185 0.320 0.239
2019-01-01 04:00:00 0.402 0.258 0.183 0.326 0.286
2019-01-01 05:00:00 0.281 0.271 0.196 0.346 0.151
I have tried: Energy.add(loadshift, fill_value=0)
but I get
Concatenation operation is not implemented for NumPy arrays, use np.concatenate() instead. Please do not rely on this error; it may not be given on all Python implementations.
also tried:
df_merged = pd.concat([Energy,loadshift], ignore_index=True, sort=False)
df_merged =Energy.append(loadshift)
this prints:
InvalidIndexError: Reindexing only valid with uniquely valued Index objects
How do I please go about to fixing these errors. Thanks
Try merge and add
# merge the two frames on the index, which is time in this case
df = loadshift.merge(energy, left_index=True, right_index=True)
# add the load difference to all the columns
new = df[df.columns[1:]].add(df['load_difference'], axis=0)
Affluent Adversity Affluent.1 Comfortable Adversity.1
Time
2019-01-01 01:00:00 0.354 0.344 0.255 0.315 0.374
2019-01-01 02:00:00 0.446 0.254 0.183 0.346 0.146
2019-01-01 03:00:00 0.459 0.266 0.235 0.370 0.289
2019-01-01 04:00:00 0.402 0.258 0.183 0.326 0.286
2019-01-01 05:00:00 0.281 0.271 0.196 0.346 0.151
2019-12-31 20:00:00 1.002 0.163 2.057 0.109 2.756
2019-12-31 21:00:00 0.812 0.369 1.509 0.312 0.597
2019-12-31 22:00:00 0.548 0.424 0.223 0.427 0.349
2019-12-31 23:00:00 0.549 0.552 0.172 0.352 0.283
2020-01-01 00:00:00 0.366 0.191 0.010 0.103 0.017
Related
I have a pandas data frame containing a large-ish set of hourly data points. For a few days, there are missing data (NaN). I want to interpolate values for the missing hourly data points by calculating the mean of the same time period on the prior and following day (I've done some analysis and believe this will be reasonable).
An example of the data is below:
datetime
value
2018-11-17 00:00:00
9.12
2018-11-17 01:00:00
8.94
2018-11-17 02:00:00
8.68
2018-11-17 03:00:00
8.19
2018-11-17 04:00:00
7.75
2018-11-17 05:00:00
7.35
2018-11-17 06:00:00
7.05
2018-11-17 07:00:00
6.55
2018-11-17 08:00:00
6.30
2018-11-17 09:00:00
6.28
2018-11-17 10:00:00
6.68
2018-11-17 11:00:00
7.64
2018-11-17 12:00:00
8.61
2018-11-17 13:00:00
9.44
2018-11-17 14:00:00
9.84
2018-11-17 15:00:00
9.62
2018-11-17 16:00:00
8.17
2018-11-17 17:00:00
6.16
2018-11-17 18:00:00
5.93
2018-11-17 19:00:00
5.36
2018-11-17 20:00:00
4.69
2018-11-17 21:00:00
4.36
2018-11-17 22:00:00
4.68
2018-11-17 23:00:00
4.86
2018-11-18 00:00:00
NaN
2018-11-18 01:00:00
NaN
2018-11-18 02:00:00
NaN
2018-11-18 03:00:00
NaN
2018-11-18 04:00:00
NaN
2018-11-18 05:00:00
NaN
2018-11-18 06:00:00
NaN
2018-11-18 07:00:00
NaN
2018-11-18 08:00:00
NaN
2018-11-18 09:00:00
NaN
2018-11-18 10:00:00
NaN
2018-11-18 11:00:00
NaN
2018-11-18 12:00:00
NaN
2018-11-18 13:00:00
NaN
2018-11-18 14:00:00
NaN
2018-11-18 15:00:00
NaN
2018-11-18 16:00:00
NaN
2018-11-18 17:00:00
NaN
2018-11-18 18:00:00
NaN
2018-11-18 19:00:00
NaN
2018-11-18 20:00:00
NaN
2018-11-18 21:00:00
NaN
2018-11-18 22:00:00
NaN
2018-11-18 23:00:00
NaN
2018-11-19 00:00:00
3.19
2018-11-19 01:00:00
2.60
2018-11-19 02:00:00
2.29
2018-11-19 03:00:00
1.97
2018-11-19 04:00:00
2.19
2018-11-19 05:00:00
3.09
2018-11-19 06:00:00
4.32
2018-11-19 07:00:00
4.87
2018-11-19 08:00:00
5.14
2018-11-19 09:00:00
5.55
2018-11-19 10:00:00
6.34
2018-11-19 11:00:00
7.43
2018-11-19 12:00:00
8.18
2018-11-19 13:00:00
8.53
2018-11-19 14:00:00
8.45
2018-11-19 15:00:00
7.94
2018-11-19 16:00:00
6.87
2018-11-19 17:00:00
5.56
2018-11-19 18:00:00
4.65
2018-11-19 19:00:00
4.18
2018-11-19 20:00:00
3.97
2018-11-19 21:00:00
3.98
2018-11-19 22:00:00
4.01
2018-11-19 23:00:00
4.00
So, for example, the desired output for 2018-11-18 00:00:00 would be the mean of 9.12 and 3.19 = 6.16. And so on for the other hours of the day on 2018-11-18.
Is there a simple way to do this in pandas? Ideally with a method that could be applied to a whole column (feature) within a data frame, rather than having to slice out some of the data, transform it, and then replace (because honestly, it would be a lot quicker for me to do that in excel!).
Thanks in advance for your help.
Try:
#make sure every hour is in the datetime
df = df.set_index("datetime").resample("1h").last()
#create a series of means averaging the values 24 hours before and after
means = df["value"].shift(24).add(df["value"].shift(-24)).mul(0.5)
#fill the NaN in df with means
df["value"] = df["value"].combine_first(means)
>>> df.iloc[24:48]
value
datetime
2018-11-18 00:00:00 6.155
2018-11-18 01:00:00 5.770
2018-11-18 02:00:00 5.485
2018-11-18 03:00:00 5.080
2018-11-18 04:00:00 4.970
2018-11-18 05:00:00 5.220
2018-11-18 06:00:00 5.685
2018-11-18 07:00:00 5.710
2018-11-18 08:00:00 5.720
2018-11-18 09:00:00 5.915
2018-11-18 10:00:00 6.510
2018-11-18 11:00:00 7.535
2018-11-18 12:00:00 8.395
2018-11-18 13:00:00 8.985
2018-11-18 14:00:00 9.145
2018-11-18 15:00:00 8.780
2018-11-18 16:00:00 7.520
2018-11-18 17:00:00 5.860
2018-11-18 18:00:00 5.290
2018-11-18 19:00:00 4.770
2018-11-18 20:00:00 4.330
2018-11-18 21:00:00 4.170
2018-11-18 22:00:00 4.345
2018-11-18 23:00:00 4.430
I have this data on a csv file:
Date/Time kWh kVArh kVA PF
0 2021-01-01 00:30:00 471.84 0.00 943.6800 1.0000
1 2021-01-01 01:00:00 491.04 1.44 982.0842 1.0000
2 2021-01-01 01:30:00 475.20 0.00 950.4000 1.0000
3 2021-01-01 02:00:00 470.88 0.00 941.7600 1.0000
4 2021-01-01 02:30:00 466.56 0.00 933.1200 1.0000
... ... ... ... ... ...
9223 2021-07-14 04:00:00 1104.00 53.28 2210.5698 0.9988
9224 2021-07-14 04:30:00 1156.30 49.92 2314.7542 0.9991
9225 2021-07-14 05:00:00 1176.00 37.92 2353.2224 0.9995
9226 2021-07-14 05:30:00 1177.00 27.36 2354.6359 0.9997
9227 2021-07-14 06:00:00 1196.60 22.56 2393.6253 0.9998
And I use this code to read it and later export it to a csv file, after I calculate the average for every hour.
import pandas as pd
file = pd.read_csv('Electricity_data.csv',
sep = ',',
skiprows = 0,
dayfirst = True,
parse_dates = ['Date/Time'])
pd_mean = file.groupby(pd.Grouper(key = 'Date/Time', freq = 'H')).mean().reset_index()
pd_mean.to_csv("data_1h_year_.csv")
However, when I run it, my final file has a gap.
Data before the code launches (Date: 03/01/2021):
Date/Time kWh kVArh kVA PF
90 2021-02-01 21:30:00 496.83 0.00 993.6600 1.0
91 2021-02-01 22:00:00 486.72 0.00 973.4400 1.0
92 2021-02-01 22:30:00 490.08 0.00 980.1600 1.0
93 2021-02-01 23:00:00 503.00 1.92 1006.0073 1.0
94 2021-02-01 23:30:00 484.84 0.00 969.6800 1.0
95 2021-03-01 00:00:00 484.80 0.00 969.6000 1.0
96 2021-03-01 00:30:00 487.68 0.00 975.3600 1.0
97 2021-03-01 01:00:00 508.30 1.44 1016.6041 1.0
98 2021-03-01 01:30:00 488.66 0.00 977.3200 1.0
99 2021-03-01 02:00:00 486.24 0.00 972.4800 1.0
100 2021-03-01 02:30:00 495.36 1.44 990.7242 1.0
101 2021-03-01 03:00:00 484.32 0.00 968.6400 1.0
102 2021-03-01 03:30:00 485.76 0.00 971.5200 1.0
103 2021-03-01 04:00:00 492.48 1.44 984.9642 1.0
104 2021-03-01 04:30:00 476.16 0.00 952.3200 1.0
105 2021-03-01 05:00:00 477.12 0.00 954.2400 1.0
Data after the code launches (Date: 03/01/2021):
Date/Time kWh kVArh kVA PF
45 2021-01-02 21:00:00 1658.650 292.32 3368.45000 0.98485
46 2021-01-02 22:00:00 1622.150 291.60 3296.34415 0.98420
47 2021-01-02 23:00:00 1619.300 261.36 3280.52380 0.98720
48 2021-01-03 00:00:00 NaN NaN NaN NaN
49 2021-01-03 01:00:00 NaN NaN NaN NaN
50 2021-01-03 02:00:00 NaN NaN NaN NaN
51 2021-01-03 03:00:00 NaN NaN NaN NaN
52 2021-01-03 04:00:00 NaN NaN NaN NaN
53 2021-01-03 05:00:00 NaN NaN NaN NaN
54 2021-01-03 06:00:00 1202.400 158.40 2425.57730 0.99140
55 2021-01-03 07:00:00 1209.375 168.00 2441.98105 0.99050
56 2021-01-03 08:00:00 1260.950 162.72 2542.89820 0.99175
57 2021-01-03 09:00:00 1308.975 195.60 2647.07935 0.98900
58 2021-01-03 10:00:00 1334.150 193.20 2696.17005 0.98965
I do not know why this is happening, but it didn't calculate the mean values and I got the NaN forming gaps around the final csv file.
Pandas does not interpret correctly your dates. Specify the format yourself.
Use the code below to solve your problem:
parser = lambda x: pd.to_datetime(x, format='%m/%d/%Y %H:%M')
df = pd.read_csv('data.csv', sep=',', skiprows=0,
parse_dates=['Date/Time'], date_parser=parser)
pd_mean = df.groupby(pd.Grouper(key='Date/Time', freq='H')).mean()
Check your dates before operation:
93 2021-02-01 23:00:00 # February, 1st
94 2021-02-01 23:30:00 # February, 1st
95 2021-03-01 00:00:00 # March, 1st
96 2021-03-01 00:30:00 # March, 1st
I am trying to use 2 timestamps from a data frame to extract (slice) a subset from a different dataframe where those timestamps are included as indexes. Here is what I've tried.
In [152]: wt_filtered.head()
Out[152]:
start_time end_time tot_PR
81 2013-07-13 05:00:00 2013-07-13 07:00:00 0.015
164 2013-10-31 19:00:00 2013-10-31 21:00:00 0.030
234 2013-12-09 16:00:00 2013-12-09 18:00:00 0.015
295 2014-01-11 02:00:00 2014-01-11 07:00:00 0.060
325 2014-02-05 17:00:00 2014-02-05 19:00:00 0.015
And this is my second dataframe:
In [160]: my_df.head()
Out[160]:
ValidLevel ValidVelocity ... GW_Level PR
DateTime ...
2013-06-07 00:00:00 2.07 0.91 ... 444.74 0.000
2013-06-07 01:00:00 2.01 0.46 ... 444.74 0.010
2013-06-07 02:00:00 1.82 0.54 ... 444.74 0.005
2013-06-07 03:00:00 1.98 0.68 ... 444.74 0.005
2013-06-07 04:00:00 1.92 0.59 ... 444.74 0.015
I want to extract data from the second data frame using the start_time and end_time as parameters in the slicing process.
I tried this:
my_df[wt_filtered.start_time[0]:wt_filtered.end_time[0]]
And did not work. Maybe I am overlooking something but cannot find the answer.
I want to get something like this:
DateTime ValidLevel ValidVelocity ... GW_Level PR
2013-07-13 05:00:00 1.24 0.99 ... 445.06 0.005
2013-07-13 06:00:00 1.29 1.51 ... 445.08 0.005
2013-07-13 07:00:00 1.57 1.44 ... 445.11 0.005
A good practice for this situation is to access the df's data using .loc and .iloc method:
my_df.loc[wt_filtered.start_time.iloc[0]:wt_filtered.end_time.iloc[0],:]
Hope it is useful
I am trying to find the gaps in a time series( 30 min interval) and fill them with NaN.
I am referring to the instructions on this post but it turns out that the code is not performing well in my case.
Find gaps in pandas time series dataframe sampled at 1 minute intervals and fill the gaps with new rows
import pandas as pd
from datetime import datetime
df=pd.read_csv('example1.csv')
df.head(5)
Date & Time KW
0 8/27/2019 23:30 0.016
1 8/27/2019 23:00 0
2 8/27/2019 22:30 0.016
3 8/27/2019 22:00 0.016
4 8/27/2019 21:30 0
df['Date & Time'] = pd.to_datetime(df['Date & Time'], format='%m/%d/%Y %H:%M')
df=df.set_index('Date & Time').asfreq('30M')
df.head()
KW
Date & Time
As you can see the output is blank, I am expected to see values.
I am not sure what I am doing wrong.
M is month and your 'Date & Time' column is decreasing. You need to use T and negative value
Try on this sample
df:
Date & Time KW
0 8/27/2019 23:30 0.016
1 8/27/2019 23:00 0.000
2 8/27/2019 22:30 0.016
3 8/27/2019 22:00 0.016
4 8/27/2019 21:30 0.000
5 8/27/2019 19:30 0.000
df.set_index('Date & Time').asfreq('-30T')
Out[412]:
KW
Date & Time
2019-08-27 23:30:00 0.016
2019-08-27 23:00:00 0.000
2019-08-27 22:30:00 0.016
2019-08-27 22:00:00 0.016
2019-08-27 21:30:00 0.000
2019-08-27 21:00:00 NaN
2019-08-27 20:30:00 NaN
2019-08-27 20:00:00 NaN
2019-08-27 19:30:00 0.000
one years' data showed as follows:
datetime data
2008-01-01 00:00:00 0.044
2008-01-01 00:30:00 0.031
2008-01-01 01:00:00 -0.25
.....
2008-01-31 23:00:00 0.036
2008-01-31 23:30:00 0.42
2008-01-02 00:00:00 0.078
2008-01-02 00:30:00 0.008
2008-01-02 01:00:00 0.09
2008-01-02 01:30:00 0.054
.....
2008-12-31 22:00:00 0.55
2008-12-31 22:30:00 0.05
2008-12-31 23:00:00 0.08
2008-12-31 23:30:00 0.033
There is a value per half-hour. I want the sum of all values in a day, so convert to 365 rows values.
year day sum values
2008 1 *
2008 2 *
...
2008 364 *
2008 365 *
You can use dt.year + dt.dayofyear with groupby and aggregate sum:
df = df.groupby([df['datetime'].dt.year, df['datetime'].dt.dayofyear]).sum()
print (df)
data
datetime datetime
2008 1 -0.175
2 0.230
31 0.456
366 0.713
And if need DataFrame is possible convert index to column and set columns names by reset_index + rename_axis:
df = df.groupby([df['datetime'].dt.year, df['datetime'].dt.dayofyear])['data']
.sum()
.rename_axis(('year','dayofyear'))
.reset_index()
print (df)
year dayofyear data
0 2008 1 -0.175
1 2008 2 0.230
2 2008 31 0.456
3 2008 366 0.713