I try to groupby my dataset in sets of 15mins instead of hours. I currently have my yearly data grouped into dayhour 0 to 23, but in order to have more datapoints, I would like to do this for every 15 mins, such that I end up with 00:00,00:15.... 23:45
This is the first part of my inital dataframe merged:
Price Afnemen Invoeden ... Temperature Precipation NWSE
StartTime ...
2018-06-13 00:00:00 42.30 34.02 34.02 ... 13.60 0.0 N
2018-06-13 00:15:00 42.30 42.57 42.57 ... 13.60 0.0 N
2018-06-13 00:30:00 42.30 42.02 42.02 ... 13.60 0.0 N
2018-06-13 00:45:00 42.30 46.09 46.09 ... 13.60 0.0 N
With this line merged= merged.groupby(merged.index.hour).mean()
I get the hourly means
StartTime Price Afnemen ... Windspeed Temperature Precipation
0 47.163836 47.910985 ... 3.508562 9.591096 0.045890
1 44.473082 46.274221 ... 3.500000 9.265582 0.041438
2 42.862123 43.309392 ... 3.445205 8.974658 0.060959
However, I would like to get something like:
StartTime Price Afnemen ... Windspeed Temperature Precipation
00:00 (Some value here)
00:15
...
23:45
I thought about using merged.groupby(merged.index.hour,merged.index.minute).mean()
But in this way I would get two index columns. This is not desirable, as the final goal is to plot the datapoints.
I hope, this question is clear and thanks in advance!
Assuming your index is a DateTimeIndex you can use Grouper:
agg_15m = df.groupby(pd.Grouper(freq='15Min')).mean()
link
Related
I have a .csv that I grabbed online here: https://marketplace.spp.org/file-browser-api/download/generation-mix-historical?path=%2FGenMix_2017.csv
The first column is date/time and is broken into 5 minute intervals (military time). I need to ensure that the dates are only for '2017' as there is some data at the end of the .csv from 2018. I want to be able to capture all the data but only in one hour increments.
For example in that .csv, this would be:
2017-01-01T06:00:00Z to 2017-01-01T06:55:00Z which is 12 rows.
This is only the case for 2017-01-01 that it starts at time: 6:00:00 all others start at 0:00:00
I was thinking that I might be able to just iterate ONLY for '2017' data by 12 step increments to get the hour blocks of time, and then once it has run 12*24 times it resets, not sure how to do this.
But also not sure if that would be a good idea it terms of future use cases, it may be that the times change or that there are missing data. Trying to ensure that this won't break if it is used in a few years. It's probably safe to say the company producing this data will continue to produce it the same way, but I guess you never know.
Here is what I have so far:
# puts api call into a pandas dataframe
energy_data = pd.read_csv('https://marketplace.spp.org/file-browser-api/download/generation-mix-historical?path=%2FGenMix_2017.csv')
# puts the date/time field into proper float64 format
energy_data['GMT MKT Interval'] = pd.to_datetime(energy_data['GMT MKT Interval'])
# ensures that the entire dataframe can be treated as time series data
energy_data.set_index('GMT MKT Interval', inplace = True)
Use resample_sum:
df = pd.read_csv('GenMix_2017.csv', parse_dates=['GMT MKT Interval'],
index_col='GMT MKT Interval')
out = df.resample('H').sum()
Output:
>>> out
Coal Market Coal Self Diesel Fuel Oil Hydro Natural Gas ... Waste Disposal Services Wind Waste Heat Other Average Actual Load
GMT MKT Interval ...
2017-01-01 06:00:00+00:00 34104.7 159041.4 0.0 3220.5 35138.3 ... 113.8 57517.0 0 431.3 303688.602
2017-01-01 07:00:00+00:00 32215.4 156570.6 0.0 3326.3 33545.2 ... 132.9 63397.0 0 422.9 304163.427
2017-01-01 08:00:00+00:00 29604.7 152379.6 0.0 3246.0 33851.4 ... 133.2 64230.5 0 358.1 300871.117
2017-01-01 09:00:00+00:00 28495.9 149474.0 0.0 2973.1 35171.5 ... 131.9 65860.7 0 344.5 298908.514
2017-01-01 10:00:00+00:00 29304.8 146561.1 0.0 3161.2 34315.4 ... 133.8 67882.8 0 340.9 299825.531
... ... ... ... ... ... ... ... ... ... ... ...
2018-01-01 01:00:00+00:00 36071.3 216336.8 55.2 16093.1 93466.6 ... 140.4 75547.5 0 327.6 463542.027
2018-01-01 02:00:00+00:00 35339.9 213596.9 55.2 14378.4 97397.7 ... 114.6 75277.5 0 325.4 459252.079
2018-01-01 03:00:00+00:00 35051.4 217333.2 55.2 12334.3 96351.1 ... 107.3 69376.7 0 328.1 453214.866
2018-01-01 04:00:00+00:00 35220.7 220868.9 53.2 8520.8 98404.2 ... 116.9 60699.7 0 328.5 446139.366
2018-01-01 05:00:00+00:00 35392.1 223590.8 52.2 8980.9 103893.6 ... 131.1 48453.0 0 329.8 439107.888
[8760 rows x 12 columns]
To give a brief overview of whats going on, I am observing temperature fluctutations and have filtered data from indoor and outdoor temp in an office only where temperature fluctuates. these fluctuations only occur in the mornings and at night as during the day, the temp is controlled. I will be using an ANN to learn from these fluctuations and model how long it would take for temp to change depending on other variables like OutdoorTemp, SolarDiffuseRate, etc.
Question 1:
How do I iterate by row, firstly, looking at times and adding a binary column where 0 would be mornings, and 1 would be the night-time.
Question 2:
for each day, there will be a different length of series of rows for mornings and evenings depending on how long it takes the temperature to change between 22 degrees and 17 degrees. How do I add a column for each day, and each morning and evening, which states the time it took for the temp to get from X to Y.
Basically adding or subtracting time to get the difference, then appending per morning or night.
OfficeTemp OutdoorTemp SolarDiffuseRate
DateTime
2006-01-01 07:15:00 19.915275 0.8125 0.0
2006-01-01 07:30:00 20.463506 0.8125 0.0
2006-01-01 07:45:00 20.885112 0.8125 0.0
2006-01-01 20:15:00 19.985398 8.3000 0.0
2006-01-01 20:30:00 19.157857 8.3000 0.0
... ... ... ...
2006-06-30 22:00:00 18.056205 22.0125 0.0
2006-06-30 22:15:00 17.993072 19.9875 0.0
2006-06-30 22:30:00 17.929643 19.9875 0.0
2006-06-30 22:45:00 17.867148 19.9875 0.0
2006-06-30 23:00:00 17.804429 19.9875 0.0
df = pd.DataFrame(index=pd.date_range('2006-01-01', '2006-06-30', freq='15min'))
df['OfficeTemp'] = np.random.normal(loc=20, scale=5, size=df.shape[0])
df['OutdoorTemp'] = np.random.normal(loc=12, scale=5, size=df.shape[0])
df['SolarDiffuseRate'] = 0.0
Question 1:
df['PartofDay'] = df.index.hour.map(lambda x: 0 if x < 12 else 1)
For question 2, a tolerance would need to be defined (the temperature is never going to be exactly 17 or 22 degrees).
import numpy as np
def temp_change_duration(group):
tol=0.3
first_time = group.index[np.isclose(group['OfficeTemp'], 17, atol=tol)][0]
second_time = group.index[np.isclose(group['OfficeTemp'], 22, atol=tol)][0]
return(abs(second_time-first_time))
Then apply this function to our df:
df.groupby([df.index.day, 'PartofDay']).apply(temp_change_duration)
This will get you most of the way there, but will give funny answers using the normally distributed synthetic data I've generated. See if you can adapt temp_change_duration to work with your data
I have a dataframe that looks like this:
Date DFW
242 2000-05-01 00:00:00 75.92
243 2000-05-01 12:00:00 75.02
244 2000-05-02 00:00:00 71.96
245 2000-05-02 12:00:00 75.92
246 2000-05-03 00:00:00 71.96
... ... ...
14991 2020-07-09 12:00:00 93.90
14992 2020-07-10 00:00:00 91.00
14993 2020-07-10 12:00:00 93.00
14994 2020-07-11 00:00:00 89.10
14995 2020-07-11 12:00:00 97.00
The df contains the max value of temperature for a specific location every 12 hours from May - July 11 during 2000-2020. I want to count the number of times that the value is >90 and then store that value in a column where the row is the year. Should I use groupby to accomplish this?
Expected output:
Year count
2000 x
2001 y
... ...
2019 z
2020 a
You can do with groupby:
# extract the years from dates
years = df['Date'].dt.year
# compare `DFW` with `90`
# gt90 will be just True or False
gt90 = df['DFW'].gt(90)
# sum the `True` by years
output = gt90.groupby(years).sum()
# set the years as normal column:
output = output.reset_index()
All that in one line:
df['DFW'].gt(90).groupby().sum().reset_index()
One possible approach is to extract and create a new column for year (let's say "year") and then,
df[df['DFW'] > 90].groupby('year').count().reset_index()
I have imported a csv file into python. It has readings at 5 min intervals over a period of a month. There are about 250 readings per 5 min timestamp. Below is a sample of one row per timestamp. Is there a way to split the csv into different dataframes grouped by date or even 5 min interval for plotting purposes? Like i mentioned, this dataset has 250 readings per 5 min interval for a month so I would like to do this without having to hard code each dataframe for each day or each interval in the set.
df.head()
tmc_code measurement_tstamp ... miles road_order
0 112-05650 2018-05-01 00:00:00 ... 0.427814 768.0
1 112-05650 2018-05-01 00:05:00 ... 0.427814 768.0
2 112-05650 2018-05-01 00:10:00 ... 0.427814 768.0
3 112-05650 2018-05-01 00:15:00 ... 0.427814 768.0
4 112-05650 2018-05-01 00:20:00 ... 0.427814 768.0
What it sounds like to me is that you want a new DataFrame for each date. If that is what you desire, the following code will take your dataframe, and make a list of dataframes, each which will only contain data for one date.
df.measurement_tstamp = df.measurement_tstamp.str[:10]
l = df.measurement_tstamp.unique()
data = [df.loc[df['measurement_tstamp']==i] for i in l]
Edit
If you want to do it by 5 min interval, it's even simpler!
data = [df.loc[df['measurement_tstamp']==i] for i in df.measurement_tstamp.unique()]
That should do it
How do I resample a dataframe with a daily time-series index to yearly, but not from 1st Jan to 31th Dec. Instead I want the yearly sum from 1.June to 31.May.
First I did this, which gives me the yearly sum from 1.Jan to 31.Dec:
df.resample(rule='A').sum()
I have tried using the base-parameter, but it does not change the resample sum.
df.resample(rule='A', base=100).sum()
Here is a part of my dataframe:
In []: df
Out[]:
Index ET P R
2010-01-01 00:00:00 -0.013 0.0 0.773
2010-01-02 00:00:00 0.0737 0.21 0.797
2010-01-03 00:00:00 -0.048 0.0 0.926
...
In []: df.resample(rule='A', base = 0, label='left').sum()
Out []:
Index
2009-12-31 00:00:00 424.131138 871.48 541.677405
2010-12-31 00:00:00 405.625780 939.06 575.163096
2011-12-31 00:00:00 461.586365 1064.82 710.507947
...
I would really appreciate if anyone could help me figuring out how to do this.
Thank you
Use 'AS-JUN' as the rule with resample:
# Example data
idx = pd.date_range('2017-01-01', '2018-12-31')
s = pd.Series(1, idx)
# Resample
s = s.resample('AS-JUN').sum()
The resulting output:
2016-06-01 151
2017-06-01 365
2018-06-01 214
Freq: AS-JUN, dtype: int64