Avoid DataFrame.resample to change the hour - python

I am trying to extract the minimum value for each day in a dataset containing hourly prices. This I want to do for every hour separately since I later want to add other information to each hour, before combining the dataset again (which is why I want to keep the hour in datetime).
This is my data:
Price_REG1 Price_REG2 ... Price_24_3 Price_24_4
date ...
2020-01-01 00:00:00 30.83 30.83 ... NaN NaN
2020-01-01 01:00:00 28.78 28.78 ... NaN NaN
2020-01-01 02:00:00 28.45 28.45 ... 30.83 30.83
2020-01-01 03:00:00 27.90 27.90 ... 28.78 28.78
2020-01-01 04:00:00 27.52 27.52 ... 28.45 28.45
To extract the minimum I use this command:
df_min_1 = df_hour_1[['Price_REG1', 'Price_REG2', 'Price_REG3',
'Price_REG4']].between_time('00:00', '23:00').resample('d').min()
Which leaves me with this:
Price_REG1 Price_REG2 Price_REG3 Price_REG4
date
2020-01-01 25.07 25.07 25.07 25.07
2020-01-02 12.07 12.07 12.07 12.07
2020-01-03 0.14 0.14 0.14 0.14
2020-01-04 3.83 3.83 3.83 3.83
2020-01-05 25.77 25.77 25.77 25.77
I understand that the resample does this, but I want to know if there is any way to avoid this, or if there is any other way to achieve the results I am after.
To clarify, this is what I would like to have:
Price_REG1 Price_REG2 Price_REG3 Price_REG4
date
2020-01-01 01:00:00 25.07 25.07 25.07 25.07
2020-01-02 01:00:00 12.07 12.07 12.07 12.07
2020-01-03 01:00:00 0.14 0.14 0.14 0.14
2020-01-04 01:00:00 3.83 3.83 3.83 3.83
2020-01-05 01:00:00 25.77 25.77 25.77 25.77

I did not find a nice solution to this problem, I managed to get where I want though with this method:
t = datetime.timedelta(hours=1)
df_min = df_min.reset_index()
df_min['date'] = df_min['date'] + t
df_min.set_index('date', inplace = True)
df_hour_1 = pd.concat([df_hour_1, df_min], axis=1)
That is, I first create a timedelta of 01:00:00, I then reset the index to be able to add the timedelta to the date column. In this way, I am able to contact df_hour and df_min, while still keeping the time so I can concat all 24 datasets in a later step.

Related

How to convert hourly data to half hourly

I have the following dataframe:
datetime temp
0 2015-01-01 00:00:00 11.22
1 2015-01-01 01:00:00 11.32
2 2015-01-01 02:00:00 11.30
3 2015-01-01 03:00:00 11.25
4 2015-01-01 04:00:00 11.32
... ... ...
31339 2018-07-29 19:00:00 17.60
31340 2018-07-29 20:00:00 17.49
31341 2018-07-29 21:00:00 17.44
31342 2018-07-29 22:00:00 17.39
31343 2018-07-29 23:00:00 17.37
I want to convert this dataframe to have data each half hour, and inpute each new position with the mean between the previous and the following value (or any similar interpolation), that is, for example:
datetime temp
0 2015-01-01 00:00:00 11.00
1 2015-01-01 00:30:00 11.50
2 2015-01-01 01:00:00 12.00
Is there any pandas/datetime function to assist in this operation?
Thank you
You can use the resample() function in Pandas. With this you can set the time to down/upsample to and then what you want to do with it (mean, sum etc.). In your case you can also interpolate between the values.
For this to work your datetime column will have to be a datetime dtype, then set it to the index.
df['datetime'] = pd.to_datetime(df['datetime'])
df.set_index('datetime', inplace=True)
Then you can resample to 30 minutes ('30T') and then interpolate.
df.resample('30T').interpolate()
Resulting in...
temp
datetime
2015-01-01 00:00:00 11.220
2015-01-01 00:30:00 11.270
2015-01-01 01:00:00 11.320
2015-01-01 01:30:00 11.310
2015-01-01 02:00:00 11.300
2015-01-01 02:30:00 11.275
2015-01-01 03:00:00 11.250
2015-01-01 03:30:00 11.285
2015-01-01 04:00:00 11.320
Read more about the frequency strings and resampling in the Pandas docs.

Decimal rounding error: numpy floor and trunc not working correctly in python when used on pandas series

I am getting some strange behavior with np.floor and np.trunc when I am using pandas dataframes.
This is the behavior I expect, which is working:
np.floor(30.000)
Out[133]: 30.0
But you can see that when I do a series it treats "floating integers" wrong, and rounds them down by one. I used a temporary fix of adding 0.00001 to my entire dataframe, which in my case is okay because of the level I am rounding to, but I would like to know what is happening and how to do it correctly.
Edit: I specifically need the round down to zero feature.
sample_series
2020-01-01 00:00:00 26.750
2020-01-01 01:00:00 27.500
2020-01-01 02:00:00 28.250
2020-01-01 03:00:00 30.000
2020-01-01 04:00:00 30.625
2020-01-01 05:00:00 31.000
2020-01-01 06:00:00 33.375
2020-01-01 07:00:00 33.750
2020-01-01 08:00:00 34.000
In: sample_series.apply(np.float)
Out:
2020-01-01 00:00:00 26.0
2020-01-01 01:00:00 27.0
2020-01-01 02:00:00 28.0
2020-01-01 03:00:00 29.0
2020-01-01 04:00:00 30.0
2020-01-01 05:00:00 30.0
2020-01-01 06:00:00 33.0
2020-01-01 07:00:00 33.0
2020-01-01 08:00:00 33.0
2020-01-01 09:00:00 33.0
2020-01-01 10:00:00 34.0
You may be able to use around() function
np.around([0.37, 1.64], decimals=1)
array([0.4, 1.6])

How to resample df based on one column and add the values from another column?

I have a df as follows:
id dates values tz
1 2020-01-01 00:15:00 87.8 +01
2 2020-01-01 00:30:00 88.3 +01
3 2020-01-01 00:45:00 89.0 +01
4 2020-01-01 01:00:00 90.1 +01
5 2020-01-01 01:15:00 91.3 +01
6 2020-01-01 01:30:00 92.4 +01
7 2020-01-01 01:45:00 92.9 +01
8 2020-01-01 02:00:00 92.5 +01
9 2020-01-01 02:15:00 91.0 +01
10 2020-01-01 02:30:00 88.7 +01
11 2020-01-01 02:45:00 86.4 +01
12 2020-01-01 03:00:00 84.7 +01
What I would like to do is to club every 4 rows based on the id column and then add the values in the values column and assign it to the dates value when the timestamp in minutes is 00
Example:
id dates values tz
1 2020-01-01 00:15:00 87.8 +01
2 2020-01-01 00:30:00 88.3 +01
3 2020-01-01 00:45:00 89.0 +01
4 2020-01-01 01:00:00 90.1 +01
When I club the first 4 values, the output should be as follows:
id dates values tz
1 2020-01-01 01:00:00 355.2 +01 <--- (87.8+88.3+89.0+90.1 = 355.2)
and similarly for the other rows as well..
The desired output:
id dates values tz
1 2020-01-01 01:00:00 355.2 +01 <--- (87.8+88.3+89.0+90.1 = 355.2)
2 2020-01-01 02:00:00 369.1 +01 <--- (91.3+92.4+92.9+91.0 = 369.1)
3 2020-01-01 03:00:00 350.8 +01 <--- (91.0+88.7+86.4+84.7 = 350.8)
How can this be done?
I think here is possible aggregate by each 4 rows with np.arange by length of DataFrame with aggregate sum with last values per groups by GroupBy.agg:
df = df.groupby(np.arange(len(df)) // 4).agg({'dates':'last','values':'sum', 'tz':'last'})
print (df)
dates values tz
0 2020-01-01 01:00:00 355.2 1
1 2020-01-01 02:00:00 369.1 1
2 2020-01-01 03:00:00 350.8 1

Concatenate all dataframe columns into a single column

I have a dataframe that looks roughly like:
01/01/19 02/01/19 03/01/19 04/01/19
hour
1.0 27.08 47.73 54.24 10.0
2.0 26.06 49.53 46.09 22.0
...
24.0 12.0 34.0 22.0 40.0
I'd like to reduce its dimension to a single column with a proper date index concatenating all the columns. Is there a smart pandas way to do it?
Expected result... something like:
01/01/19 00:00:00 27.08
01/01/19 01:00:00 26.08
...
01/01/19 23:00:00 12.00
02/01/19 00:00:00 47.73
02/01/19 01:00:00 49.53
...
02/01/19 23:00:00 34.00
...
You can stack and then fix the index using pd.to_datetime and pd.to_timedelta:
u = df.stack()
u.index = (pd.to_datetime(u.index.get_level_values(1), dayfirst=True)
+ pd.to_timedelta(u.index.get_level_values(0) - 1, unit='h'))
u.sort_index()
2019-01-01 00:00:00 27.08
2019-01-01 01:00:00 26.06
2019-01-01 23:00:00 12.00
2019-01-02 00:00:00 47.73
2019-01-02 01:00:00 49.53
2019-01-02 23:00:00 34.00
2019-01-03 00:00:00 54.24
2019-01-03 01:00:00 46.09
2019-01-03 23:00:00 22.00
2019-01-04 00:00:00 10.00
2019-01-04 01:00:00 22.00
2019-01-04 23:00:00 40.00
dtype: float64

How to resample with pandas in a conservative way?

I'm trying to use pandas dataframe/series to store measurements. These are used as inputs for a thermodynamical simulation. So when resampling the sum of heat(/energy) and the values of temperature should stay constant.
import pandas as pd
import numpy as np
start_date = '2015-01-01 '
my_timing = '60T'
my_ts_index = pd.date_range(start_date, periods=6, freq=my_timing, name='Time')
# read from a csv or sql data base
Heat = [120,200,210,0 ,50, 180]
Temperature = [23.2, 19.1, 15.3, 25.2, 20.1, 0.0]
data=np.array([(Heat),
(Temperature)])
columnNames=['Heat','Temperature']
my_ts=pd.DataFrame(np.transpose(data),columns=columnNames,index=my_ts_index)
print(my_ts)
# GUI checks if len(my_ts)%new_timestep==0
# increase frequency
tmp=my_ts.resample('25T').mean()
tmp['Heat']=tmp['Heat'].fillna(0)
tmp['Temperature']=tmp['Temperature'].ffill()
print(tmp)
# reduce frequency
tmp=tmp.resample('60T').agg({'Heat':lambda x:np.sum(x),'Temperature':np.average})
print(tmp)
tmp=my_ts.resample('75T').agg({'Heat':lambda x:np.sum(x),'Temperature':np.average})
print(tmp)
Output:
Heat Temperature
Time
2015-01-01 00:00:00 120.0 23.2
2015-01-01 01:00:00 200.0 19.1
2015-01-01 02:00:00 210.0 15.3
2015-01-01 03:00:00 0.0 25.2
2015-01-01 04:00:00 50.0 20.1
2015-01-01 05:00:00 180.0 0.0
Heat Temperature
Time
2015-01-01 00:00:00 120.0 23.2
2015-01-01 00:25:00 0.0 23.2
2015-01-01 00:50:00 200.0 19.1
2015-01-01 01:15:00 0.0 19.1
2015-01-01 01:40:00 210.0 15.3
2015-01-01 02:05:00 0.0 15.3
2015-01-01 02:30:00 0.0 15.3
2015-01-01 02:55:00 0.0 25.2
2015-01-01 03:20:00 0.0 25.2
2015-01-01 03:45:00 50.0 20.1
2015-01-01 04:10:00 0.0 20.1
2015-01-01 04:35:00 0.0 20.1
2015-01-01 05:00:00 180.0 0.0
Heat Temperature
Time
2015-01-01 00:00:00 320.0 21.833333
2015-01-01 01:00:00 210.0 17.200000
2015-01-01 02:00:00 0.0 18.600000
2015-01-01 03:00:00 50.0 22.650000
2015-01-01 04:00:00 0.0 20.100000
2015-01-01 05:00:00 180.0 0.000000
Heat Temperature
Time
2015-01-01 00:00:00 320.0 21.15
2015-01-01 01:15:00 210.0 15.30
2015-01-01 02:30:00 0.0 25.20
2015-01-01 03:45:00 50.0 20.10
2015-01-01 05:00:00 180.0 0.00
Which operations should be used to resample data to higher frequency and back to the original without loosing anything?
(See: Output 1&3+conservative properties in Output 2 of energy and temperature still have to hold)
How to set the mean/average/sum to start at the 2nd value so the first stays untouched for initial value reasons? Or inother words: How to assign the mean to the next not the previous hour?
(See: Output 4 first value pair)

Categories