I am trying to find the gaps in a time series( 30 min interval) and fill them with NaN.
I am referring to the instructions on this post but it turns out that the code is not performing well in my case.
Find gaps in pandas time series dataframe sampled at 1 minute intervals and fill the gaps with new rows
import pandas as pd
from datetime import datetime
df=pd.read_csv('example1.csv')
df.head(5)
Date & Time KW
0 8/27/2019 23:30 0.016
1 8/27/2019 23:00 0
2 8/27/2019 22:30 0.016
3 8/27/2019 22:00 0.016
4 8/27/2019 21:30 0
df['Date & Time'] = pd.to_datetime(df['Date & Time'], format='%m/%d/%Y %H:%M')
df=df.set_index('Date & Time').asfreq('30M')
df.head()
KW
Date & Time
As you can see the output is blank, I am expected to see values.
I am not sure what I am doing wrong.
M is month and your 'Date & Time' column is decreasing. You need to use T and negative value
Try on this sample
df:
Date & Time KW
0 8/27/2019 23:30 0.016
1 8/27/2019 23:00 0.000
2 8/27/2019 22:30 0.016
3 8/27/2019 22:00 0.016
4 8/27/2019 21:30 0.000
5 8/27/2019 19:30 0.000
df.set_index('Date & Time').asfreq('-30T')
Out[412]:
KW
Date & Time
2019-08-27 23:30:00 0.016
2019-08-27 23:00:00 0.000
2019-08-27 22:30:00 0.016
2019-08-27 22:00:00 0.016
2019-08-27 21:30:00 0.000
2019-08-27 21:00:00 NaN
2019-08-27 20:30:00 NaN
2019-08-27 20:00:00 NaN
2019-08-27 19:30:00 0.000
Related
So I have two data frames.
Energy: (100 columns)
Affluent Adversity Affluent Comfortable Adversity \
Time
2019-01-01 01:00:00 0.254 0.244 0.155 0.215 0.274
2019-01-01 02:00:00 0.346 0.154 0.083 0.246 0.046
2019-01-01 03:00:00 0.309 0.116 0.085 0.220 0.139
2019-01-01 04:00:00 0.302 0.158 0.083 0.226 0.186
2019-01-01 05:00:00 0.181 0.171 0.096 0.246 0.051
... ... ... ... ... ...
2019-12-31 20:00:00 1.102 0.263 2.157 0.209 2.856
2019-12-31 21:00:00 0.712 0.269 1.409 0.212 0.497
2019-12-31 22:00:00 0.398 0.274 0.073 0.277 0.199
2019-12-31 23:00:00 0.449 0.452 0.072 0.252 0.183
2020-01-01 00:00:00 0.466 0.291 0.110 0.203 0.117
loadshift: (1 column)
Time load_difference
2019-01-01 01:00:00 0.10
2019-01-01 02:00:00 0.10
2019-01-01 03:00:00 0.15
2019-01-01 04:00:00 0.10
2019-01-01 05:00:00 0.10
... ...
2019-12-31 20:00:00 -0.10
2019-12-31 21:00:00 0.10
2019-12-31 22:00:00 0.15
2019-12-31 23:00:00 0.10
2020-01-01 00:00:00 -0.10
all I want to do is add the load difference to the df1 so for example the first affluent house at 1 am would change to 0.345. I have been able to use concat to multiply in my other models but somehow really struggling with this.
Expected output(but for all 8760 hours):
Affluent Adversity Affluent Comfortable Adversity \
Time
2019-01-01 01:00:00 0.354 0.344 0.255 0.315 0.374
2019-01-01 02:00:00 0.446 0.254 0.183 0.446 0.146
2019-01-01 03:00:00 0.409 0.216 0.185 0.320 0.239
2019-01-01 04:00:00 0.402 0.258 0.183 0.326 0.286
2019-01-01 05:00:00 0.281 0.271 0.196 0.346 0.151
I have tried: Energy.add(loadshift, fill_value=0)
but I get
Concatenation operation is not implemented for NumPy arrays, use np.concatenate() instead. Please do not rely on this error; it may not be given on all Python implementations.
also tried:
df_merged = pd.concat([Energy,loadshift], ignore_index=True, sort=False)
df_merged =Energy.append(loadshift)
this prints:
InvalidIndexError: Reindexing only valid with uniquely valued Index objects
How do I please go about to fixing these errors. Thanks
Try merge and add
# merge the two frames on the index, which is time in this case
df = loadshift.merge(energy, left_index=True, right_index=True)
# add the load difference to all the columns
new = df[df.columns[1:]].add(df['load_difference'], axis=0)
Affluent Adversity Affluent.1 Comfortable Adversity.1
Time
2019-01-01 01:00:00 0.354 0.344 0.255 0.315 0.374
2019-01-01 02:00:00 0.446 0.254 0.183 0.346 0.146
2019-01-01 03:00:00 0.459 0.266 0.235 0.370 0.289
2019-01-01 04:00:00 0.402 0.258 0.183 0.326 0.286
2019-01-01 05:00:00 0.281 0.271 0.196 0.346 0.151
2019-12-31 20:00:00 1.002 0.163 2.057 0.109 2.756
2019-12-31 21:00:00 0.812 0.369 1.509 0.312 0.597
2019-12-31 22:00:00 0.548 0.424 0.223 0.427 0.349
2019-12-31 23:00:00 0.549 0.552 0.172 0.352 0.283
2020-01-01 00:00:00 0.366 0.191 0.010 0.103 0.017
I have this data on a csv file:
Date/Time kWh kVArh kVA PF
0 2021-01-01 00:30:00 471.84 0.00 943.6800 1.0000
1 2021-01-01 01:00:00 491.04 1.44 982.0842 1.0000
2 2021-01-01 01:30:00 475.20 0.00 950.4000 1.0000
3 2021-01-01 02:00:00 470.88 0.00 941.7600 1.0000
4 2021-01-01 02:30:00 466.56 0.00 933.1200 1.0000
... ... ... ... ... ...
9223 2021-07-14 04:00:00 1104.00 53.28 2210.5698 0.9988
9224 2021-07-14 04:30:00 1156.30 49.92 2314.7542 0.9991
9225 2021-07-14 05:00:00 1176.00 37.92 2353.2224 0.9995
9226 2021-07-14 05:30:00 1177.00 27.36 2354.6359 0.9997
9227 2021-07-14 06:00:00 1196.60 22.56 2393.6253 0.9998
And I use this code to read it and later export it to a csv file, after I calculate the average for every hour.
import pandas as pd
file = pd.read_csv('Electricity_data.csv',
sep = ',',
skiprows = 0,
dayfirst = True,
parse_dates = ['Date/Time'])
pd_mean = file.groupby(pd.Grouper(key = 'Date/Time', freq = 'H')).mean().reset_index()
pd_mean.to_csv("data_1h_year_.csv")
However, when I run it, my final file has a gap.
Data before the code launches (Date: 03/01/2021):
Date/Time kWh kVArh kVA PF
90 2021-02-01 21:30:00 496.83 0.00 993.6600 1.0
91 2021-02-01 22:00:00 486.72 0.00 973.4400 1.0
92 2021-02-01 22:30:00 490.08 0.00 980.1600 1.0
93 2021-02-01 23:00:00 503.00 1.92 1006.0073 1.0
94 2021-02-01 23:30:00 484.84 0.00 969.6800 1.0
95 2021-03-01 00:00:00 484.80 0.00 969.6000 1.0
96 2021-03-01 00:30:00 487.68 0.00 975.3600 1.0
97 2021-03-01 01:00:00 508.30 1.44 1016.6041 1.0
98 2021-03-01 01:30:00 488.66 0.00 977.3200 1.0
99 2021-03-01 02:00:00 486.24 0.00 972.4800 1.0
100 2021-03-01 02:30:00 495.36 1.44 990.7242 1.0
101 2021-03-01 03:00:00 484.32 0.00 968.6400 1.0
102 2021-03-01 03:30:00 485.76 0.00 971.5200 1.0
103 2021-03-01 04:00:00 492.48 1.44 984.9642 1.0
104 2021-03-01 04:30:00 476.16 0.00 952.3200 1.0
105 2021-03-01 05:00:00 477.12 0.00 954.2400 1.0
Data after the code launches (Date: 03/01/2021):
Date/Time kWh kVArh kVA PF
45 2021-01-02 21:00:00 1658.650 292.32 3368.45000 0.98485
46 2021-01-02 22:00:00 1622.150 291.60 3296.34415 0.98420
47 2021-01-02 23:00:00 1619.300 261.36 3280.52380 0.98720
48 2021-01-03 00:00:00 NaN NaN NaN NaN
49 2021-01-03 01:00:00 NaN NaN NaN NaN
50 2021-01-03 02:00:00 NaN NaN NaN NaN
51 2021-01-03 03:00:00 NaN NaN NaN NaN
52 2021-01-03 04:00:00 NaN NaN NaN NaN
53 2021-01-03 05:00:00 NaN NaN NaN NaN
54 2021-01-03 06:00:00 1202.400 158.40 2425.57730 0.99140
55 2021-01-03 07:00:00 1209.375 168.00 2441.98105 0.99050
56 2021-01-03 08:00:00 1260.950 162.72 2542.89820 0.99175
57 2021-01-03 09:00:00 1308.975 195.60 2647.07935 0.98900
58 2021-01-03 10:00:00 1334.150 193.20 2696.17005 0.98965
I do not know why this is happening, but it didn't calculate the mean values and I got the NaN forming gaps around the final csv file.
Pandas does not interpret correctly your dates. Specify the format yourself.
Use the code below to solve your problem:
parser = lambda x: pd.to_datetime(x, format='%m/%d/%Y %H:%M')
df = pd.read_csv('data.csv', sep=',', skiprows=0,
parse_dates=['Date/Time'], date_parser=parser)
pd_mean = df.groupby(pd.Grouper(key='Date/Time', freq='H')).mean()
Check your dates before operation:
93 2021-02-01 23:00:00 # February, 1st
94 2021-02-01 23:30:00 # February, 1st
95 2021-03-01 00:00:00 # March, 1st
96 2021-03-01 00:30:00 # March, 1st
I am trying to use 2 timestamps from a data frame to extract (slice) a subset from a different dataframe where those timestamps are included as indexes. Here is what I've tried.
In [152]: wt_filtered.head()
Out[152]:
start_time end_time tot_PR
81 2013-07-13 05:00:00 2013-07-13 07:00:00 0.015
164 2013-10-31 19:00:00 2013-10-31 21:00:00 0.030
234 2013-12-09 16:00:00 2013-12-09 18:00:00 0.015
295 2014-01-11 02:00:00 2014-01-11 07:00:00 0.060
325 2014-02-05 17:00:00 2014-02-05 19:00:00 0.015
And this is my second dataframe:
In [160]: my_df.head()
Out[160]:
ValidLevel ValidVelocity ... GW_Level PR
DateTime ...
2013-06-07 00:00:00 2.07 0.91 ... 444.74 0.000
2013-06-07 01:00:00 2.01 0.46 ... 444.74 0.010
2013-06-07 02:00:00 1.82 0.54 ... 444.74 0.005
2013-06-07 03:00:00 1.98 0.68 ... 444.74 0.005
2013-06-07 04:00:00 1.92 0.59 ... 444.74 0.015
I want to extract data from the second data frame using the start_time and end_time as parameters in the slicing process.
I tried this:
my_df[wt_filtered.start_time[0]:wt_filtered.end_time[0]]
And did not work. Maybe I am overlooking something but cannot find the answer.
I want to get something like this:
DateTime ValidLevel ValidVelocity ... GW_Level PR
2013-07-13 05:00:00 1.24 0.99 ... 445.06 0.005
2013-07-13 06:00:00 1.29 1.51 ... 445.08 0.005
2013-07-13 07:00:00 1.57 1.44 ... 445.11 0.005
A good practice for this situation is to access the df's data using .loc and .iloc method:
my_df.loc[wt_filtered.start_time.iloc[0]:wt_filtered.end_time.iloc[0],:]
Hope it is useful
one years' data showed as follows:
datetime data
2008-01-01 00:00:00 0.044
2008-01-01 00:30:00 0.031
2008-01-01 01:00:00 -0.25
.....
2008-01-31 23:00:00 0.036
2008-01-31 23:30:00 0.42
2008-01-02 00:00:00 0.078
2008-01-02 00:30:00 0.008
2008-01-02 01:00:00 0.09
2008-01-02 01:30:00 0.054
.....
2008-12-31 22:00:00 0.55
2008-12-31 22:30:00 0.05
2008-12-31 23:00:00 0.08
2008-12-31 23:30:00 0.033
There is a value per half-hour. I want the sum of all values in a day, so convert to 365 rows values.
year day sum values
2008 1 *
2008 2 *
...
2008 364 *
2008 365 *
You can use dt.year + dt.dayofyear with groupby and aggregate sum:
df = df.groupby([df['datetime'].dt.year, df['datetime'].dt.dayofyear]).sum()
print (df)
data
datetime datetime
2008 1 -0.175
2 0.230
31 0.456
366 0.713
And if need DataFrame is possible convert index to column and set columns names by reset_index + rename_axis:
df = df.groupby([df['datetime'].dt.year, df['datetime'].dt.dayofyear])['data']
.sum()
.rename_axis(('year','dayofyear'))
.reset_index()
print (df)
year dayofyear data
0 2008 1 -0.175
1 2008 2 0.230
2 2008 31 0.456
3 2008 366 0.713
I have a csv which looks like this:
Date,Sentiment
2014-01-03,0.4
2014-01-04,-0.03
2014-01-09,0.0
2014-01-10,0.07
2014-01-12,0.0
2014-02-24,0.0
2014-02-25,0.0
2014-02-25,0.0
2014-02-26,0.0
2014-02-28,0.0
2014-03-01,0.1
2014-03-02,-0.5
2014-03-03,0.0
2014-03-08,-0.06
2014-03-11,-0.13
2014-03-22,0.0
2014-03-23,0.33
2014-03-23,0.3
2014-03-25,-0.14
2014-03-28,-0.25
etc
And my goal is to aggregate date by months and calculate average of months. Dates might not start with 1. or January. Problem is that I have a lot of data, that means I have more years. For this purpose I would like to find the soonest date (month) and from there start counting months and their averages. For example:
Month count, average
1, 0.4 (<= the earliest month)
2, -0.3
3, 0.0
...
12, 0.1
13, -0.4 (<= new year but counting of month is continuing)
14, 0.3
I'm using Pandas to open csv
data = pd.read_csv("pks.csv", sep=",")
so in data['Date'] I have dates and in data['Sentiment'] I have values. Any idea how to do it?
Probably the simplest approach is to use the resample command. First, when you read in your data make sure you parse the dates and set the date column as your index (ignore the StringIO part and the header=True ... I am reading in your sample data from a multi-line string):
>>> df = pd.read_csv(StringIO(data),header=True,parse_dates=['Date'],
index_col='Date')
>>> df
Sentiment
Date
2014-01-03 0.40
2014-01-04 -0.03
2014-01-09 0.00
2014-01-10 0.07
2014-01-12 0.00
2014-02-24 0.00
2014-02-25 0.00
2014-02-25 0.00
2014-02-26 0.00
2014-02-28 0.00
2014-03-01 0.10
2014-03-02 -0.50
2014-03-03 0.00
2014-03-08 -0.06
2014-03-11 -0.13
2014-03-22 0.00
2014-03-23 0.33
2014-03-23 0.30
2014-03-25 -0.14
2014-03-28 -0.25
>>> df.resample('M').mean()
Sentiment
2014-01-31 0.088
2014-02-28 0.000
2014-03-31 -0.035
And if you want a month counter, you can add it after your resample:
>>> agg = df.resample('M',how='mean')
>>> agg['cnt'] = range(len(agg))
>>> agg
Sentiment cnt
2014-01-31 0.088 0
2014-02-28 0.000 1
2014-03-31 -0.035 2
You can also do this with the groupby method and the TimeGrouper function (group by month and then call the mean convenience method that is available with groupby).
>>> df.groupby(pd.TimeGrouper(freq='M')).mean()
Sentiment
2014-01-31 0.088
2014-02-28 0.000
2014-03-31 -0.035
To get the monthly average values of a Data Frame when the DataFrame has daily data rows 'Sentiment', I would:
Convert the column with the dates , df['dates'] into the index of the DataFrame df: df.set_index('date',inplace=True)
Then I'll convert the index dates into a month-index: df.index.month
Finally I'll calculate the mean of the DataFrame GROUPED BY MONTH: df.groupby(df.index.month).Sentiment.mean()
I go slowly throw each step here:
Generation DataFrame with dates and values
You need first to import Pandas and Numpy, as well as the module datetime
from datetime import datetime
Generate a Column 'date' between 1/1/2019 and the 3/05/2019, at week 'W' intervals. And a column 'Sentiment'with random values between 1-100:
date_rng = pd.date_range(start='1/1/2018', end='3/05/2018', freq='W')
df = pd.DataFrame(date_rng, columns=['date'])
df['Sentiment']=np.random.randint(0,100,size=(len(date_rng)))
the df has two columns 'date' and 'Sentiment':
date Sentiment
0 2018-01-07 34
1 2018-01-14 32
2 2018-01-21 15
3 2018-01-28 0
4 2018-02-04 95
5 2018-02-11 53
6 2018-02-18 7
7 2018-02-25 35
8 2018-03-04 17
Set 'date'column as the index of the DataFrame:
df.set_index('date',inplace=True)
df has one column 'Sentiment' and the index is 'date':
Sentiment
date
2018-01-07 34
2018-01-14 32
2018-01-21 15
2018-01-28 0
2018-02-04 95
2018-02-11 53
2018-02-18 7
2018-02-25 35
2018-03-04 17
Capture the month number from the index
months=df.index.month
Obtain the mean value of each month grouping by month:
monthly_avg=df.groupby(months).Sentiment.mean()
The mean of the dataset by month 'monthly_avg' is:
date
1 20.25
2 47.50
3 17.00