faster way to calculate a rolling sum in a dataframe - python

To calculate a volume weighted moving average (VWMA) I am collecting a sum(price*volume) and dividing it by the sum(volume).
I need a faster way to get a value from the previous row and add it to a value on the current row.
I have the following dataframe:
import pandas as pd
from itertools import repeat
df = pd.DataFrame({'dtime': ['16:00', '15:00', '14:00', '13:00', '12:00', '11:00', '10:00', '09:00', '08:00', '07:00', '06:00', '05:00', '04:00', '03:00', '02:00', '01:00'],
'time': [1800, 1740, 1680, 1620, 1560, 1500, 1440, 1380, 1320, 1260, 1200, 1140, 1080, 1020, 960, 900],
'price': [100.1, 102.7, 108.5, 105.3, 107.1, 103.4, 101.8, 102.7, 101.6, 99.8, 100.2, 97.7, 99.3, 100.1, 102.5, 103.9],
'volume': [6.0, 6.5, 5.4, 6.3, 6.4, 7.1, 6.7, 6.2, 5.7, 1.2, 2.4, 3.9, 5.2, 8.9, 7.2, 6.5]
}, columns = ['dtime', 'time', 'price', 'volume']).set_index('dtime')
df.insert(df.shape[1], "PV", df['price']*df['volume'])
df.insert(df.shape[1], "flag", list(repeat(0.0,len(df))))
df.insert(df.shape[1], "PVsum_2", list(repeat(0.0,len(df))))
df.insert(df.shape[1], "Vsum_2", list(repeat(0.0,len(df))))
df.insert(df.shape[1], "VWMA_2", list(repeat(0.0,len(df))))
Which is
df =
time price volume PV flag PVsum_2 Vsum_2 VWMA_2
dtime
16:00 1800 100.1 6.0 600.60 0.0 0.0 0.0 0.0
15:00 1740 102.7 6.5 667.55 0.0 0.0 0.0 0.0
14:00 1680 108.5 5.4 585.90 0.0 0.0 0.0 0.0
13:00 1620 105.3 6.3 663.39 0.0 0.0 0.0 0.0
12:00 1560 107.1 6.4 685.44 0.0 0.0 0.0 0.0
11:00 1500 103.4 7.1 734.14 0.0 0.0 0.0 0.0
10:00 1440 101.8 6.7 682.06 0.0 0.0 0.0 0.0
09:00 1380 102.7 6.2 636.74 0.0 0.0 0.0 0.0
08:00 1320 101.6 5.7 579.12 0.0 0.0 0.0 0.0
07:00 1260 99.8 1.2 119.76 0.0 0.0 0.0 0.0
06:00 1200 100.2 2.4 240.48 0.0 0.0 0.0 0.0
05:00 1140 97.7 3.9 381.03 0.0 0.0 0.0 0.0
04:00 1080 99.3 5.2 516.36 0.0 0.0 0.0 0.0
03:00 1020 100.1 8.9 890.89 0.0 0.0 0.0 0.0
02:00 960 102.5 7.2 738.00 0.0 0.0 0.0 0.0
01:00 900 103.9 6.5 675.35 0.0 0.0 0.0 0.0
Right now I am using a for loop to check each row if 'flag' is set.
#----pseudo code----
#for each row in df (from bottom to top, excluding the very bottom row)
# if flag[row] is not set:
# PVsum_2[row] = PV[row] + PV[row + 1]
# Vsum_2[row] = volume[row] + volume[row + 1]
# VWMA_2[row] = PVsum_2[row] / Vsum_2[row]
# flag[row] = 1.0
#----pseudo code----
my_dict = {'dtime' : 0,
'time' : 1,
'price' : 2,
"volume" : 3,
'PV' : 4,
'check' : 5,
'PVsum_2': 6,
'Vsum_2' : 7,
'VWMA_2' : 8}
for row in reversed(range(len(df)-1)):
# if flag value is not set (i.e. flag == 0)
if not df['flag'][row]:
# sum of current and previous PV (price*volume) values
a = df['PV'][row] + df['PV'][row+1]
df.iloc[row, my_dict['PVsum_2']-1] = a
# sum of current and previous volumes
b = df['volume'][row] + df['volume'][row+1]
df.iloc[row, my_dict['Vsum_2']-1] = b
# PVsum_2 / Vsum_2
c = (a / b) if b != 0.0 else 0.0
df.iloc[row, my_dict['VWMA_2']-1] = c
# set check value to 1.0
df.iloc[row, my_dict['flag']-1] = 1.0
but this takes too long on large sets of data (500+ rows)
I'm looking for something faster and more elegant.
The dataframe should look like this when it is done (notice the bottom row has not been altered):
df =
time price volume PV flag PVsum_2 Vsum_2 VWMA_2
dtime
16:00 1800 100.1 6.0 600.60 1.0 1268.15 12.5 101.452000
15:00 1740 102.7 6.5 667.55 1.0 1253.45 11.9 105.331933
14:00 1680 108.5 5.4 585.90 1.0 1249.29 11.7 106.776923
13:00 1620 105.3 6.3 663.39 1.0 1348.83 12.7 106.207087
12:00 1560 107.1 6.4 685.44 1.0 1419.58 13.5 105.154074
11:00 1500 103.4 7.1 734.14 1.0 1416.20 13.8 102.623188
10:00 1440 101.8 6.7 682.06 1.0 1318.80 12.9 102.232558
09:00 1380 102.7 6.2 636.74 1.0 1215.86 11.9 102.173109
08:00 1320 101.6 5.7 579.12 1.0 698.88 6.9 101.286957
07:00 1260 99.8 1.2 119.76 1.0 360.24 3.6 100.066667
06:00 1200 100.2 2.4 240.48 1.0 621.51 6.3 98.652381
05:00 1140 97.7 3.9 381.03 1.0 897.39 9.1 98.614286
04:00 1080 99.3 5.2 516.36 1.0 1407.25 14.1 99.804965
03:00 1020 100.1 8.9 890.89 1.0 1628.89 16.1 101.173292
02:00 960 102.5 7.2 738.00 1.0 1413.35 13.7 103.164234
01:00 900 103.9 6.5 675.35 0.0 0.00 0.0 0.000000
Eventually new data will be added to the top of the data frame as seen below, and will need to be updated again.
df =
time price volume PV flag PVsum_2 Vsum_2 VWMA_2
dtime
19:00 1980 100.1 6.0 600.60 0.0 0.0 0.0 0.0
18:00 1920 102.7 6.5 667.55 0.0 0.0 0.0 0.0
17:00 1860 108.5 5.4 585.90 0.0 0.0 0.0 0.0
16:00 1800 100.1 6.0 600.60 1.0 1268.15 12.5 101.452000
15:00 1740 102.7 6.5 667.55 1.0 1253.45 11.9 105.331933
14:00 1680 108.5 5.4 585.90 1.0 1249.29 11.7 106.776923
13:00 1620 105.3 6.3 663.39 1.0 1348.83 12.7 106.207087
12:00 1560 107.1 6.4 685.44 1.0 1419.58 13.5 105.154074
11:00 1500 103.4 7.1 734.14 1.0 1416.20 13.8 102.623188
10:00 1440 101.8 6.7 682.06 1.0 1318.80 12.9 102.232558
09:00 1380 102.7 6.2 636.74 1.0 1215.86 11.9 102.173109
08:00 1320 101.6 5.7 579.12 1.0 698.88 6.9 101.286957
07:00 1260 99.8 1.2 119.76 1.0 360.24 3.6 100.066667
06:00 1200 100.2 2.4 240.48 1.0 621.51 6.3 98.652381
05:00 1140 97.7 3.9 381.03 1.0 897.39 9.1 98.614286
04:00 1080 99.3 5.2 516.36 1.0 1407.25 14.1 99.804965
03:00 1020 100.1 8.9 890.89 1.0 1628.89 16.1 101.173292
02:00 960 102.5 7.2 738.00 1.0 1413.35 13.7 103.164234
01:00 900 103.9 6.5 675.35 0.0 0.00 0.0 0.000000

It looks like you're not using pandas in the right way. I'd recommend taking a quick look at a tutorial.
For starters, the following lines
df.insert(df.shape[1], "flag", list(repeat(0.0,len(df))))
df.insert(df.shape[1], "PVsum_2", list(repeat(0.0,len(df))))
df.insert(df.shape[1], "Vsum_2", list(repeat(0.0,len(df))))
df.insert(df.shape[1], "VWMA_2", list(repeat(0.0,len(df))))
can be much easier written as:
df['flag'] = 0
df['PVsum_2'] = 0
df['Vsum_2'] = 0
df['VWMA_2'] = 0
But it seems you don't even need to initialise those columns really.
You also don't need the for loop because you can align 2 dataframes, one being your original and another one is one where you've shifted all rows. For example:
df_shift = df.shift(-1)
You can then use normal vectorised calculations to achieve what you want, e.g.:
df['PVsum_2'] = df['PV'] + df_shift['PV']
df['Vsum_2'] = df['volume'] + df_shift['volume']
idx = df['Vsum_2'] != 0 # this is your check whether that value is different from 0
df.loc[idx, 'VWMA_2'] = df.loc[idx, 'PVsum_2'] / df.loc[idx, 'VSum_2'] # and now use that index to only calculate VWMA_2 where the Vsum_2 was 0
Hopefully you get the idea and can make small adjustments to make it work exactly as you want.

Related

Sum values in df1 based on .eq() in df2

I want to sum up some market volumes based on equal prices in, let's say, 6 hours of 2017.
I have a DataFrame, df1 (market_volumes), that contains the market volumes in some areas. Then I have another DataFrame, df2 (mFRR_price), which contains some market prices.
df1
Date NO1 Up NO1 Down NO2 Up ... DK1 Up DK1 Down DK2 Up DK2 Down
35062 31-12-2020 54.0 0.0 214.0 ... 33.0 0.0 31.0 0.0
35063 31-12-2020 3.0 0.0 121.0 ... 125.0 0.0 21.0 0.0
35064 31-12-2020 0.0 -28.0 0.0 ... 0.0 -9.0 0.0 0.0
35065 31-12-2020 0.0 -83.0 0.0 ... 0.0 0.0 0.0 0.0
35066 31-12-2020 0.0 -80.0 0.0 ... 0.0 -55.0 0.0 0.0
35067 31-12-2020 0.0 -42.0 0.0 ... 79.0 0.0 23.0 0.0
df2
Date NO1 Up NO2 Up NO3 Up ... SE4 Up FI Up DK1 Up DK2 Up
35062 31-12-2020 47.4 47.4 27.2 ... 61.1 61.1 94.1 94.1
35063 31-12-2020 31.0 31.0 25.7 ... 58.0 35.3 89.4 89.4
35064 31-12-2020 24.8 24.8 24.8 ... 54.5 24.8 56.7 56.7
35065 31-12-2020 24.8 24.8 24.8 ... 51.2 28.0 52.4 52.4
35066 31-12-2020 24.6 24.6 24.6 ... 45.8 26.6 51.9 51.9
35067 31-12-2020 24.1 24.1 23.3 ... 24.1 24.1 78.7 78.7
Now, I want to sum up the market volumes from df1 IF the values in a row in df2 are equal to the value in column "NO1 UP".
i.e., I am looking for a way to end up with a new DataFrame that would result in:
df3
Date NO1 Up NO1 Down NO2 Up ... DK1 Up DK1 Down DK2 Up DK2 Down SUM
35062 31-12-2020 54.0 0.0 214.0 ... 33.0 0.0 31.0 0.0 (54+214)
35063 31-12-2020 3.0 0.0 121.0 ... 125.0 0.0 21.0 0.0 (3+121)
35064 31-12-2020 0.0 -28.0 0.0 ... 0.0 -9.0 0.0 0.0 etc.
35065 31-12-2020 0.0 -83.0 0.0 ... 0.0 0.0 0.0 0.0
35066 31-12-2020 0.0 -80.0 0.0 ... 0.0 -55.0 0.0 0.0
35067 31-12-2020 0.0 -42.0 0.0 ... 79.0 0.0 23.0 0.0
... because it locates the area prices that are equal and sums the market volumes on those locations in the DataFrame.
I've been working on this:
market_volumes['sum'] = mFRR_price.eq(mFRR_price['NO1 Up'], axis=0).mul(mFRR_price['NO1 Up'], axis=0).sum(axis=1)
But it sums the values in df2 in puts it in the df1. I need the POSITIONS in df2, but the values from df1.
import pandas as pd
df3['SUM'] = df3['NO1 Up'] + df3['NO2 Up']
You can use .loc and apply boolean indexing.
df1.loc[df2['NO1 Up'] == df2['NO2 Up'], 'SUM'] = df1['NO1 Up'] + df1['NO2 Up']
df1.loc[df2['NO1 Up'] != df2['NO2 Up'], 'SUM'] = 0
First line goes down df2's index and checks if values in columns NO1 Up and NO2 Up are equal. It then creates a column called 'SUM' - the value of this new column is dependent on the outcome of the preceding boolean.. We say if preceding boolean is true, then go to the SUM column and do the below operation:
= df1['NO1 Up'] + df1['NO2 Up']
Conversely, if the outcome false, then pandas will insert 'NaN' into your SUM column.
Not sure if you are ok with NaN values. Most are not, so the second line of code is more or less the inverse of the first... If df2['NO1 Up'] != df2['NO2 Up'], then insert integer 0 in the df1 SUM column.
Again, there are probably other ways to accomplish what you want.

How to scale all data within a dataframe but one columns

My data look like:
cycles os1 os2 os3 sm1 sm2 sm3 sm4 sm5 sm6 sm7 sm8 sm9 sm10 sm11 sm12 sm13 sm14 sm15 sm16 sm17 sm18 sm19 sm20 sm21
0 1 -0.0007 -0.0004 100.0 518.67 641.82 1589.70 1400.60 14.62 21.61 554.36 2388.06 9046.19 1.3 47.47 521.66 2388.02 8138.62 8.4195 0.03 392 2388 100.0 39.06 23.4190
1 2 0.0019 -0.0003 100.0 518.67 642.15 1591.82 1403.14 14.62 21.61 553.75 2388.04 9044.07 1.3 47.49 522.28 2388.07 8131.49 8.4318 0.03 392 2388 100.0 39.00 23.4236
2 3 -0.0043 0.0003 100.0 518.67 642.35 1587.99 1404.20 14.62 21.61 554.26 2388.08 9052.94 1.3 47.27 522.42 2388.03 8133.23 8.4178 0.03 390 2388 100.0 38.95 23.3442
3 4 0.0007 0.0000 100.0 518.67 642.35 1582.79 1401.87 14.62 21.61 554.45 2388.11 9049.48 1.3 47.13 522.86 2388.08 8133.83 8.3682 0.03 392 2388 100.0 38.88 23.3739
4 5 -0.0019 -0.0002 100.0 518.67 642.37 1582.85 1406.22 14.62 21.61 554.00 2388.06 9055.15 1.3 47.28 522.19 2388.04 8133.80 8.4294 0.03 393 2388 100.0 38.90 23.4044
and I want to rescale the dataframe but cycles columns.
I've tried:
# Scale features
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
for col in data[1:].columns:
data[col] = sc.fit_transform(data[col].values.reshape(-1,1))
but it scales whole dataframe either way.
Help would be appreciated.
Thanks!
You can select all columns without first by DataFrame.iloc, here first : means all rows and 1: all columns without first and pass to fit_transform:
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
data.iloc[:, 1:] = sc.fit_transform(data.iloc[:, 1:])
print (data)
cycles os1 os2 os3 sm1 sm2 sm3 sm4 sm5 \
0 1 0.074613 -1.128152 0.0 0.0 -1.847703 0.732430 -1.349237 0.0
1 2 1.287080 -0.725241 0.0 0.0 -0.276203 1.313986 -0.034171 0.0
2 3 -1.604187 1.692228 0.0 0.0 0.676221 0.263346 0.514636 0.0
3 4 0.727480 0.483494 0.0 0.0 0.676221 -1.163111 -0.691704 0.0
4 5 -0.484987 -0.322329 0.0 0.0 0.771464 -1.146651 1.560476 0.0
sm6 sm7 sm8 sm9 sm10 sm11 sm12 sm13 \
0 0.0 0.765578 -0.422577 -0.822800 0.0 1.050958 -1.607627 -1.209416
1 0.0 -1.617089 -1.267731 -1.339486 0.0 1.198981 -0.005169 0.950255
2 0.0 0.374977 0.422577 0.822312 0.0 -0.429265 0.356676 -0.777482
3 0.0 1.117119 1.690309 -0.020960 0.0 -1.465421 1.493904 1.382189
4 0.0 -0.640586 -0.422577 1.360934 0.0 -0.355254 -0.237784 -0.345547
sm14 sm15 sm16 sm17 sm18 sm19 sm20 sm21
0 1.866394 0.265372 0.0 0.204124 0.0 0.0 1.549015 0.867102
1 -1.140246 0.795254 0.0 0.204124 0.0 0.0 0.637830 1.020631
2 -0.406508 0.192136 0.0 -1.837117 0.0 0.0 -0.121491 -1.629404
3 -0.153495 -1.944623 0.0 0.204124 0.0 0.0 -1.184541 -0.638144
4 -0.166145 0.691862 0.0 1.224745 0.0 0.0 -0.880812 0.379816

Pivot group by data

Trying to transpose and group data to look like this:
Current group by data:
MTD-Total Revenue YTD-Total Revenue MTD-Room Revenue YTD-Room Revenue MTD-Room Nights YTD-Room Nights MTD-ADR YTD-ADR MTD-OCC% YTD-OCC%
Market Group
Aff 0.0 0.0 2026136.99 21546922.96 857.0 8650.0 2457.02 2551.87 4.99 4.16
Air 0.0 0.0 2809312.53 32534587.15 925.0 9684.0 2392.08 3016.00 2.69 2.33
BAR 0.0 0.0 470866.23 8341596.95 131.0 2481.0 3189.75 3133.08 0.76 1.19
Cas 0.0 0.0 4801710.10 55466024.12 1652.0 18566.0 2365.23 2585.25 1.92 1.79
Com 0.0 0.0 3873151.63 43857524.55 1088.0 11980.0 2449.43 2632.57 6.34 5.76
Cor 0.0 0.0 7104841.79 88326080.23 2314.0 26836.0 1552.74 2919.07 4.14 3.97
Pro 0.0 0.0 335358.36 1907348.23 97.0 562.0 3457.30 3393.86 2.26 1.08
Soc 0.0 0.0 12706.96 82957.59 4.0 25.0 1588.37 3315.74 0.04 0.02
TA 0.0 0.0 1016565.12 15563472.77 416.0 6797.0 2412.55 2229.46 4.84 6.54
Wal 0.0 0.0 277267.66 3786378.41 68.0 812.0 4077.47 4663.03 1.58 1.56
Codes ran:
pd.DataFrame(df.values.reshape(-1,5))
df.reset_index().pivot('Market Group', 'MTD-Total Revenue', 'YTD-Total Revenue')
How data looks if it were to be transposed: df.T:
Answer to this would be :
df= pd.melt(df, id_vars=['Market Group'], value_vars=['MTD-Total Revenue','YTD-Total Revenue','MTD-Room Revenue','YTD-Room Revenue','MTD-Room Nights','YTD-Room Nights','MTD-ADR','YTD-ADR','MTD-OCC%','YTD-OCC%'])
This keeps the headers unlike using unstack or stack .

Pandas/Python: interpolation of multiple columns based on values specified for one reference column

df
Out[1]:
PRES HGHT TEMP DWPT RELH MIXR DRCT SKNT THTA THTE THTV
0 978.0 345 17.0 16.5 97 12.22 0 0 292.0 326.8 294.1
1 977.0 354 17.8 16.7 93 12.39 1 0 292.9 328.3 295.1
2 970.0 416 23.4 15.4 61 11.47 4 2 299.1 332.9 301.2
3 963.0 479 24.0 14.0 54 10.54 8 3 300.4 331.6 302.3
4 948.7 610 23.0 13.4 55 10.28 15 6 300.7 331.2 302.5
5 925.0 830 21.4 12.4 56 9.87 20 5 301.2 330.6 303.0
6 916.0 914 20.7 11.7 56 9.51 20 4 301.3 329.7 303.0
7 884.0 1219 18.2 9.2 56 8.31 60 4 301.8 326.7 303.3
8 853.1 1524 15.7 6.7 55 7.24 35 3 302.2 324.1 303.5
9 850.0 1555 15.4 6.4 55 7.14 20 2 302.3 323.9 303.6
10 822.8 1829 13.3 5.6 60 6.98 300 4 302.9 324.0 304.1
How do I interpolate the values of all the columns on specified PRES (pressure) values at say PRES=[950, 900, 875]? Is there an elegant pandas type of way to do this?
The only way I can think of doing this is to first start with making empty NaN values for the entire row for each specified PRES values in a loop, then set PRES as index and then use the pandas native interpolate option:
df.interpolate(method='index', inplace=True)
Is there a more elegant solution?
Use your solution with no loop - reindex by union original index values with PRES list, but working only if all values are unique:
PRES=[950, 900, 875]
df = df.set_index('PRES')
df = df.reindex(df.index.union(PRES)).sort_index(ascending=False).interpolate(method='index')
print (df)
HGHT TEMP DWPT RELH MIXR DRCT SKNT THTA THTE THTV
978.0 345.0 17.0 16.5 97.0 12.22 0.0 0.0 292.0 326.8 294.1
977.0 354.0 17.8 16.7 93.0 12.39 1.0 0.0 292.9 328.3 295.1
970.0 416.0 23.4 15.4 61.0 11.47 4.0 2.0 299.1 332.9 301.2
963.0 479.0 24.0 14.0 54.0 10.54 8.0 3.0 300.4 331.6 302.3
950.0 1829.0 13.3 5.6 60.0 6.98 300.0 4.0 302.9 324.0 304.1
948.7 610.0 23.0 13.4 55.0 10.28 15.0 6.0 300.7 331.2 302.5
925.0 830.0 21.4 12.4 56.0 9.87 20.0 5.0 301.2 330.6 303.0
916.0 914.0 20.7 11.7 56.0 9.51 20.0 4.0 301.3 329.7 303.0
900.0 1829.0 13.3 5.6 60.0 6.98 300.0 4.0 302.9 324.0 304.1
884.0 1219.0 18.2 9.2 56.0 8.31 60.0 4.0 301.8 326.7 303.3
875.0 1829.0 13.3 5.6 60.0 6.98 300.0 4.0 302.9 324.0 304.1
853.1 1524.0 15.7 6.7 55.0 7.24 35.0 3.0 302.2 324.1 303.5
850.0 1555.0 15.4 6.4 55.0 7.14 20.0 2.0 302.3 323.9 303.6
822.8 1829.0 13.3 5.6 60.0 6.98 300.0 4.0 302.9 324.0 304.1
If possible not unique values in PRES column, then use concat with sort_index:
PRES=[950, 900, 875]
df = df.set_index('PRES')
df = (pd.concat([df, pd.DataFrame(index=PRES)])
.sort_index(ascending=False)
.interpolate(method='index'))

Read ASCII-File with missing data fields with numpy.genfromtxt

My data file is like this:
abb
sdsdfmn
sfdf sdf
2011-12-05 11:00 1.0 9.0
2011-12-05 12:00 44.9 2.0
2011-12-05 13:00 66.8 4.2
2011-12-05 14:00 22.8 1.0 26.2 45.2 2.3
2011-12-05 15:00 45.7 2.0 45.0 45.6 1.4
2011-12-05 16:00 23.2 3.0 456.2 11.7 1.5
2011-12-05 17:00 67.4 4.0 999.1 45.8 0.9
2011-12-05 18:00 34.4 1.2
2011-12-05 19:00 12.4 4.2 345.1 11.1 7.6
I used numpy genfromtxt:
data = np.genfromtxt('data.txt', usecols=(0,1,3), skip_header=4, dtype=[('date','S10'),('hour','S5'),('myfloat','f8')])
The Problem is column 3 has some empty values in there (at the beginning and later on). So it read a wrong column.
I tried the delimiter-parameter, because all float columns has fixed width (delimiter=[10,5,5]), but it also fails.
Is there a workaround?

Categories