How to resample with pandas in a conservative way? - python

I'm trying to use pandas dataframe/series to store measurements. These are used as inputs for a thermodynamical simulation. So when resampling the sum of heat(/energy) and the values of temperature should stay constant.
import pandas as pd
import numpy as np
start_date = '2015-01-01 '
my_timing = '60T'
my_ts_index = pd.date_range(start_date, periods=6, freq=my_timing, name='Time')
# read from a csv or sql data base
Heat = [120,200,210,0 ,50, 180]
Temperature = [23.2, 19.1, 15.3, 25.2, 20.1, 0.0]
data=np.array([(Heat),
(Temperature)])
columnNames=['Heat','Temperature']
my_ts=pd.DataFrame(np.transpose(data),columns=columnNames,index=my_ts_index)
print(my_ts)
# GUI checks if len(my_ts)%new_timestep==0
# increase frequency
tmp=my_ts.resample('25T').mean()
tmp['Heat']=tmp['Heat'].fillna(0)
tmp['Temperature']=tmp['Temperature'].ffill()
print(tmp)
# reduce frequency
tmp=tmp.resample('60T').agg({'Heat':lambda x:np.sum(x),'Temperature':np.average})
print(tmp)
tmp=my_ts.resample('75T').agg({'Heat':lambda x:np.sum(x),'Temperature':np.average})
print(tmp)
Output:
Heat Temperature
Time
2015-01-01 00:00:00 120.0 23.2
2015-01-01 01:00:00 200.0 19.1
2015-01-01 02:00:00 210.0 15.3
2015-01-01 03:00:00 0.0 25.2
2015-01-01 04:00:00 50.0 20.1
2015-01-01 05:00:00 180.0 0.0
Heat Temperature
Time
2015-01-01 00:00:00 120.0 23.2
2015-01-01 00:25:00 0.0 23.2
2015-01-01 00:50:00 200.0 19.1
2015-01-01 01:15:00 0.0 19.1
2015-01-01 01:40:00 210.0 15.3
2015-01-01 02:05:00 0.0 15.3
2015-01-01 02:30:00 0.0 15.3
2015-01-01 02:55:00 0.0 25.2
2015-01-01 03:20:00 0.0 25.2
2015-01-01 03:45:00 50.0 20.1
2015-01-01 04:10:00 0.0 20.1
2015-01-01 04:35:00 0.0 20.1
2015-01-01 05:00:00 180.0 0.0
Heat Temperature
Time
2015-01-01 00:00:00 320.0 21.833333
2015-01-01 01:00:00 210.0 17.200000
2015-01-01 02:00:00 0.0 18.600000
2015-01-01 03:00:00 50.0 22.650000
2015-01-01 04:00:00 0.0 20.100000
2015-01-01 05:00:00 180.0 0.000000
Heat Temperature
Time
2015-01-01 00:00:00 320.0 21.15
2015-01-01 01:15:00 210.0 15.30
2015-01-01 02:30:00 0.0 25.20
2015-01-01 03:45:00 50.0 20.10
2015-01-01 05:00:00 180.0 0.00
Which operations should be used to resample data to higher frequency and back to the original without loosing anything?
(See: Output 1&3+conservative properties in Output 2 of energy and temperature still have to hold)
How to set the mean/average/sum to start at the 2nd value so the first stays untouched for initial value reasons? Or inother words: How to assign the mean to the next not the previous hour?
(See: Output 4 first value pair)

Related

How to convert hourly data to half hourly

I have the following dataframe:
datetime temp
0 2015-01-01 00:00:00 11.22
1 2015-01-01 01:00:00 11.32
2 2015-01-01 02:00:00 11.30
3 2015-01-01 03:00:00 11.25
4 2015-01-01 04:00:00 11.32
... ... ...
31339 2018-07-29 19:00:00 17.60
31340 2018-07-29 20:00:00 17.49
31341 2018-07-29 21:00:00 17.44
31342 2018-07-29 22:00:00 17.39
31343 2018-07-29 23:00:00 17.37
I want to convert this dataframe to have data each half hour, and inpute each new position with the mean between the previous and the following value (or any similar interpolation), that is, for example:
datetime temp
0 2015-01-01 00:00:00 11.00
1 2015-01-01 00:30:00 11.50
2 2015-01-01 01:00:00 12.00
Is there any pandas/datetime function to assist in this operation?
Thank you
You can use the resample() function in Pandas. With this you can set the time to down/upsample to and then what you want to do with it (mean, sum etc.). In your case you can also interpolate between the values.
For this to work your datetime column will have to be a datetime dtype, then set it to the index.
df['datetime'] = pd.to_datetime(df['datetime'])
df.set_index('datetime', inplace=True)
Then you can resample to 30 minutes ('30T') and then interpolate.
df.resample('30T').interpolate()
Resulting in...
temp
datetime
2015-01-01 00:00:00 11.220
2015-01-01 00:30:00 11.270
2015-01-01 01:00:00 11.320
2015-01-01 01:30:00 11.310
2015-01-01 02:00:00 11.300
2015-01-01 02:30:00 11.275
2015-01-01 03:00:00 11.250
2015-01-01 03:30:00 11.285
2015-01-01 04:00:00 11.320
Read more about the frequency strings and resampling in the Pandas docs.

Concatenate all dataframe columns into a single column

I have a dataframe that looks roughly like:
01/01/19 02/01/19 03/01/19 04/01/19
hour
1.0 27.08 47.73 54.24 10.0
2.0 26.06 49.53 46.09 22.0
...
24.0 12.0 34.0 22.0 40.0
I'd like to reduce its dimension to a single column with a proper date index concatenating all the columns. Is there a smart pandas way to do it?
Expected result... something like:
01/01/19 00:00:00 27.08
01/01/19 01:00:00 26.08
...
01/01/19 23:00:00 12.00
02/01/19 00:00:00 47.73
02/01/19 01:00:00 49.53
...
02/01/19 23:00:00 34.00
...
You can stack and then fix the index using pd.to_datetime and pd.to_timedelta:
u = df.stack()
u.index = (pd.to_datetime(u.index.get_level_values(1), dayfirst=True)
+ pd.to_timedelta(u.index.get_level_values(0) - 1, unit='h'))
u.sort_index()
2019-01-01 00:00:00 27.08
2019-01-01 01:00:00 26.06
2019-01-01 23:00:00 12.00
2019-01-02 00:00:00 47.73
2019-01-02 01:00:00 49.53
2019-01-02 23:00:00 34.00
2019-01-03 00:00:00 54.24
2019-01-03 01:00:00 46.09
2019-01-03 23:00:00 22.00
2019-01-04 00:00:00 10.00
2019-01-04 01:00:00 22.00
2019-01-04 23:00:00 40.00
dtype: float64

Sampling dataframe Considering NaN values+Pandas

I have a data frame like below. I want to do sampling with '3S'
So there are situations where NaN is present. What I was expecting is the data frame should do sampling with '3S' and also if there is any 'NaN' found in between then stop there and start the sampling from that index. I tried using dataframe.apply method to achieve but it looks very complex. Is there any short way to achieve?
df.sample(n=3)
Code to generate Input:
index = pd.date_range('1/1/2000', periods=13, freq='T')
series = pd.DataFrame(range(13), index=index)
print series
series.iloc[4] = 'NaN'
series.iloc[10] = 'NaN'
I tried to do sampling but after that there is no clue how to proceed.
2015-01-01 00:00:00 0.0
2015-01-01 01:00:00 1.0
2015-01-01 02:00:00 2.0
2015-01-01 03:00:00 2.0
2015-01-01 04:00:00 NaN
2015-01-01 05:00:00 3.0
2015-01-01 06:00:00 4.0
2015-01-01 07:00:00 4.0
2015-01-01 08:00:00 4.0
2015-01-01 09:00:00 NaN
2015-01-01 10:00:00 3.0
2015-01-01 11:00:00 4.0
2015-01-01 12:00:00 4.0
The new data frame should sample based on '3S' also take into account of 'NaN' if present and start the sampling from there where 'NaN' records are found.
Expected Output:
2015-01-01 02:00:00 2.0 -- Sampling after 3S
2015-01-01 03:00:00 2.0 -- Print because NaN has found in Next
2015-01-01 04:00:00 NaN -- print NaN record
2015-01-01 07:00:00 4.0 -- Sampling after 3S
2015-01-01 08:00:00 4.0 -- Print because NaN has found in Next
2015-01-01 09:00:00 NaN -- print NaN record
2015-01-01 12:00:00 4.0 -- Sampling after 3S
Use:
index = pd.date_range('1/1/2000', periods=13, freq='H')
df = pd.DataFrame({'col': range(13)}, index=index)
df.iloc[4, 0] = np.nan
df.iloc[9, 0] = np.nan
print (df)
col
2000-01-01 00:00:00 0.0
2000-01-01 01:00:00 1.0
2000-01-01 02:00:00 2.0
2000-01-01 03:00:00 3.0
2000-01-01 04:00:00 NaN
2000-01-01 05:00:00 5.0
2000-01-01 06:00:00 6.0
2000-01-01 07:00:00 7.0
2000-01-01 08:00:00 8.0
2000-01-01 09:00:00 NaN
2000-01-01 10:00:00 10.0
2000-01-01 11:00:00 11.0
2000-01-01 12:00:00 12.0
m = df['col'].isna()
s1 = m.ne(m.shift()).cumsum()
t = pd.Timedelta(2, unit='H')
mask = df.index >= df.groupby(s1)['col'].transform(lambda x: x.index[0]) + t
df1 = df[mask | m]
print (df1)
col
2000-01-01 02:00:00 2.0
2000-01-01 03:00:00 3.0
2000-01-01 04:00:00 NaN
2000-01-01 07:00:00 7.0
2000-01-01 08:00:00 8.0
2000-01-01 09:00:00 NaN
2000-01-01 12:00:00 12.0
Explanation:
Create mask for compare missing values by Series.isna
Create groups by consecutive values by comparing shifted values with Series.ne (!=)
print (s1)
2000-01-01 00:00:00 1
2000-01-01 01:00:00 1
2000-01-01 02:00:00 1
2000-01-01 03:00:00 1
2000-01-01 04:00:00 2
2000-01-01 05:00:00 3
2000-01-01 06:00:00 3
2000-01-01 07:00:00 3
2000-01-01 08:00:00 3
2000-01-01 09:00:00 4
2000-01-01 10:00:00 5
2000-01-01 11:00:00 5
2000-01-01 12:00:00 5
Freq: H, Name: col, dtype: int32
Get first value of index per groups, add timdelta (for expected output are added 2T) and compare by DatetimeIndex
Last filter by boolean indexing and chained masks by | for bitwise OR
One way would be to Fill the NAs with 0:
df['Col_of_Interest'] = df['Col_of_Interest'].fillna(0)
And then have the resampling to be done on the series:
(if datetime is your index)
series.resample('30S').asfreq()

Using pandas dataframe and matplotlib to manipulate data from a csv file into a plot

Here is what I'm trying to do: build a dataframe that has a datetime index created from column 0. Use resample function over a quaterly period, create a plot that shows the quarterly precipitation total amounts over the 14 year period.
second plot
make a plot of the average monthly precip and the monthly standard dev. Plot both values on the same axes.
Here's my code so far:
%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
plt.style.use('seaborn-whitegrid')
matplotlib.rcParams['figure.figsize'] = (10.0, 4.0)
df = pd.read_csv("ColumbusPrecipData.csv")
df.set_index("date", inplace = True)
#df['date'] = pd.to_datetime(df[['']])
print(df)
#build plots
#axes = plt.subplot()
#start = pd.to_datetime
#end = pd.to_datetime
#axes.set_xlim(start, end)
#axes.set_title("")
#axes.set_ylabel("")
#axes.tick_params(axis='x', rotation=45)
#axes.legend(loc='best')
Here's what the dataframe looks like:
Unnamed: 0 Precip
0 2000-01-01 01:00:00 0.0
1 2000-01-01 02:00:00 0.0
2 2000-01-01 03:00:00 0.0
3 2000-01-01 04:00:00 0.0
4 2000-01-01 05:00:00 0.0
5 2000-01-01 06:00:00 0.0
6 2000-01-01 07:00:00 0.0
7 2000-01-01 08:00:00 0.0
8 2000-01-01 09:00:00 0.0
9 2000-01-01 10:00:00 0.0
10 2000-01-01 11:00:00 0.0
11 2000-01-01 12:00:00 0.0
12 2000-01-01 13:00:00 0.0
13 2000-01-01 14:00:00 0.0
14 2000-01-01 15:00:00 0.0
15 2000-01-01 16:00:00 0.0
16 2000-01-01 17:00:00 0.0
17 2000-01-01 18:00:00 0.0
18 2000-01-01 19:00:00 0.0
19 2000-01-01 20:00:00 0.0
20 2000-01-01 21:00:00 0.0
21 2000-01-01 22:00:00 0.0
22 2000-01-01 23:00:00 0.0
23 2000-01-02 00:00:00 0.0
24 2000-01-02 01:00:00 0.0
25 2000-01-02 02:00:00 0.0
26 2000-01-02 03:00:00 0.0
27 2000-01-02 04:00:00 0.0
28 2000-01-02 05:00:00 0.0
29 2000-01-02 06:00:00 0.0
... ... ...
122696 2013-12-30 09:00:00 0.0
122697 2013-12-30 10:00:00 0.0
122698 2013-12-30 11:00:00 0.0
122699 2013-12-30 12:00:00 0.0
122700 2013-12-30 13:00:00 0.0
122701 2013-12-30 14:00:00 0.0
122702 2013-12-30 15:00:00 0.0
122703 2013-12-30 16:00:00 0.0
122704 2013-12-30 17:00:00 0.0
122705 2013-12-30 18:00:00 0.0
122706 2013-12-30 19:00:00 0.0
122707 2013-12-30 20:00:00 0.0
122708 2013-12-30 21:00:00 0.0
122709 2013-12-30 22:00:00 0.0
122710 2013-12-30 23:00:00 0.0
122711 2013-12-31 00:00:00 0.0
122712 2013-12-31 01:00:00 0.0
122713 2013-12-31 02:00:00 0.0
122714 2013-12-31 03:00:00 0.0
122715 2013-12-31 04:00:00 0.0
122716 2013-12-31 05:00:00 0.0
122717 2013-12-31 06:00:00 0.0
122718 2013-12-31 07:00:00 0.0
122719 2013-12-31 08:00:00 0.0
122720 2013-12-31 09:00:00 0.0
122721 2013-12-31 10:00:00 0.0
122722 2013-12-31 11:00:00 0.0
122723 2013-12-31 12:00:00 0.0
122724 2013-12-31 13:00:00 0.0
122725 2013-12-31 14:00:00 0.0
[122726 rows x 2 columns]
df = df.rename( columns={"Unnamed: 0": "date"})
df = df.set_index(pd.DatetimeIndex(df['date']))
Then
df1 = df.groupby(pd.Grouper(freq='M')).mean()
plt.plot(df1)

How to apply a expanding window formula that restarts with change in date in Pandas dataframe?

My data looks like this
Date and Time Close dif
2015/01/01 17:00:00.211 2030.25 0.3
2015/01/01 17:00:02.456 2030.75 0.595137615
2015/01/01 23:55:01.491 2037.25 2.432613592
2015/01/02 00:02:01.955 2036.75 -0.4
2015/01/02 00:04:04.887 2036.5 -0.391144414
2015/01/02 15:14:56.207 2021.5 -4.732676608
2015/01/05 15:14:59.020 2021.5 -4.731171953
2015/01/05 17:00:00.105 2020.5 0
2015/01/05 17:00:01.077 2021 0.423093923
I want to do a cumsum of the dif column that resets every every day, so the output would look like:
Date and Time Close dif Cum_
2015/01/01 17:00:00.211 2030.25 0.3 0.3
2015/01/01 17:00:02.456 2030.75 0.5 0.8
2015/01/01 23:55:01.491 2037.25 2.4 3.2
2015/01/02 00:02:01.955 2036.75 0.4 0.4
2015/01/02 00:04:04.887 2036.5 0.3 0.7
2015/01/02 15:14:56.207 2021.5 4.7 5.0
2015/01/05 17:00:00.020 2021.5 4.7 4.7
2015/01/05 17:00:00.105 2020.5 0 4.7
2015/01/05 17:00:01.077 2021 0.4 5.1
Thanks
Using a similar example:
df = pd.DataFrame({'time': pd.DatetimeIndex(freq='H', start=date(2015,1,1), periods=100), 'value': np.random.random(100)}).set_index('time')
print(df.groupby(pd.TimeGrouper('D')).apply(lambda x: x.cumsum()))
value
time
2015-01-01 00:00:00 0.112809
2015-01-01 01:00:00 0.175091
2015-01-01 02:00:00 0.257127
2015-01-01 03:00:00 0.711317
2015-01-01 04:00:00 1.372902
2015-01-01 05:00:00 1.544617
2015-01-01 06:00:00 1.748132
2015-01-01 07:00:00 2.547540
2015-01-01 08:00:00 2.799640
2015-01-01 09:00:00 2.913003
2015-01-01 10:00:00 3.883643
2015-01-01 11:00:00 3.926428
2015-01-01 12:00:00 4.045293
2015-01-01 13:00:00 4.214375
2015-01-01 14:00:00 4.456385
2015-01-01 15:00:00 5.374335
2015-01-01 16:00:00 5.828024
2015-01-01 17:00:00 6.295117
2015-01-01 18:00:00 7.171010
2015-01-01 19:00:00 7.907834
2015-01-01 20:00:00 8.132203
2015-01-01 21:00:00 9.007994
2015-01-01 22:00:00 9.755925
2015-01-01 23:00:00 10.373546
2015-01-02 00:00:00 0.797521
2015-01-02 01:00:00 1.582709
2015-01-02 02:00:00 1.811771
2015-01-02 03:00:00 2.493248
2015-01-02 04:00:00 3.278923
2015-01-02 05:00:00 3.626356
... ...
2015-01-03 22:00:00 11.625891
2015-01-03 23:00:00 12.597532
2015-01-04 00:00:00 0.075442
2015-01-04 01:00:00 0.155059
2015-01-04 02:00:00 0.754960
2015-01-04 03:00:00 0.926798
2015-01-04 04:00:00 1.890215
2015-01-04 05:00:00 2.734722
2015-01-04 06:00:00 2.803935
2015-01-04 07:00:00 3.103064
2015-01-04 08:00:00 3.727508
2015-01-04 09:00:00 4.117465
2015-01-04 10:00:00 4.250926
2015-01-04 11:00:00 4.996832
2015-01-04 12:00:00 5.081889
2015-01-04 13:00:00 5.493243
2015-01-04 14:00:00 5.987519
2015-01-04 15:00:00 6.719041
2015-01-04 16:00:00 7.325912
2015-01-04 17:00:00 8.163208
2015-01-04 18:00:00 9.015092
2015-01-04 19:00:00 9.062396
2015-01-04 20:00:00 9.350298
2015-01-04 21:00:00 9.947669
2015-01-04 22:00:00 10.820609
2015-01-04 23:00:00 11.165523
2015-01-05 00:00:00 0.385323
2015-01-05 01:00:00 0.999182
2015-01-05 02:00:00 1.240272
2015-01-05 03:00:00 1.398086
So in your example, do df.set_index('Date & Time') and then groupby and apply. You can of course assign the result back to the original DataFrame.

Categories