Pandas groupby and calculated sum - python

currently iam translating some R Scripts to Python. But i am struggeling with the following line:
return(trackTable[, .(
AVERAGE_WIND_COMPONENT = sum(TRACK_WIND_COMPONENT*GROUND_DIST, na.rm = T)/sum(GROUND_DIST, na.rm = T) #PRÜFEN!!!!!
), by=KEY_COLUMN])
Now I tried to rewrite the R code in python:
table['temp'] = (table['track_wind_component'] * table['ground_dist']) / table['ground_dist']
AVERAGE_WIND_COMPONENT = table.groupby(['KEY_COLUMN'])['temp'].sum()
AVERAGE_WIND_COMPONENT = pd.DataFrame({'KEY_COLUMN':AVERAGE_WIND_COMPONENT.index, 'AVERAGE_WIND_COMPONENT':AVERAGE_WIND_COMPONENT.values})
But my results for the AVERAGE_WIND_COMPONENT are wrong...What did I translate wrong here? It is probably the groupby and as I build my temp column.
Example df:
KEY_COLUMN track_wind_component ground_dist
0 xyz -0.000000 2.262407
1 xyz 0.000000 9.769840
2 xyz -135.378229 38.581616
3 xyz 11.971863 30.996997
4 xyz -78.208083 45.404430
5 xyz -88.718762 48.589553
6 xyz -118.302506 22.193426
7 xyz -71.033648 76.602917
8 xyz -68.369886 11.092901
9 xyz -65.706124 6.210328
10 xyz -60.822561 17.444752
11 xyz 39.630277 18.082869
12 xyz 102.477706 35.175366
13 xyz 43.061773 8.793499
14 xyz -71.036785 15.289568
15 xyz 65.246215 49.247986
16 xyz -29.249612 1.043781
17 xyz -25.848495 11.490416
18 xyz -11.223688 NaN
expected result for this KEY_COLUMN: -36.8273304

OK now your expected result makes sense now
First create a function that uses np.sum() this is the equivalent of R's sum(value, na.rm = T)
def my_agg(df):
names = {
'result': np.sum(df['track_wind_component'] * df['ground_dist']) / np.sum(df['ground_dist'])
}
return pd.Series(names, index=['result'])
df.groupby('KEY_COLUMN').apply(my_agg)
out:
result
KEY_COLUMN
xyz -36.827331
What was wrong with your code:
table['temp'] = (table['track_wind_component'] * table['ground_dist']) / table['ground_dist']
# this is just creating a column that is the exact same as
# table['track_wind_component'] because, for example, (x*y)/y = x
AVERAGE_WIND_COMPONENT = table.groupby(['KEY_COLUMN'])['temp'].sum()
# you are now essentially just grouping and summing the track_wind_column
what the R code is doing is taking the sum of (table['track_wind_component'] * table['ground_dist']) divided by the sum of (table['ground_dist'])
all which is grouped by the key_column
The R code is also ignoring NaN values that is why I used np.sum()

Related

Process and return data from a group of a group

I have a pandas dataframe of 3 variables, 2 categorical and 2 numeric.
ID
Trimester
State
Tax
rate
45
T1
NY
20
0.25
23
T3
FL
34
0.3
35
T2
TX
45
0.6
I would like to get a new table of the form:
ID
Trimester
State
Tax
rate
Tax_per_state_per_trimester
45
T1
NY
20
0.25
H
23
T3
FL
34
0.3
L
35
T2
TX
45
0.6
M
where the new variable 'Tax_per_state_per_trimester' is a categorical variable representing the tertiles of the corresponding subgroup, where L = first tertile, M = second tertile, L = last tertile
I understand I can do a double grouping with:
df.groupby(['State', 'Trimester'])
but i don't know how to go from there.
I guess apply or transform with the quantile function should prove useful, but how?
Can you take a look and see if this gives you the results you want ?
df = pd.read_excel('Tax.xlsx')
def mx(tri,state):
return df[(df['Trimester'].eq(tri)) & (df['State'].eq(state))] \
.groupby(['Trimester','State'])['Tax'].apply(max)[0]
for i,v in df.iterrows():
t = (v['Tax'] / mx(v['Trimester'],v['State']))
df.loc[i,'Tax_per_state_per_trimester'] = 'L' if t < 1/3 else 'M' if t < 2/3 else 'H'

Product scoring in pandas dataframe

I do have product id dataframe. I would like to find the best product by scoring each product. For each variable the more the value the better the product score except returns which means more returns less score.Also I need to assign different weight to score for the variables Shipped revenue and returns that maybe increased by 20 percent of their importance.
A scoring function can look like this
Score=ShippedUnits+1.2*ShippedRevenue+OrderedUnits-1.2Returns+View+Stock
where 0<=Score<=100
Please help. Thank you.
df_product=pd.DataFrame({'ProductId':['1','2','3','4','5','6','7','8','9','10'],'ShippedUnits':
[6,8,0,4,27,3,4,14,158,96],'ShippedRevenue':[268,1705,1300,950,1700,33380,500,2200,21000,24565]
,'OrderedUnits':[23,78,95,52,60,76,68,92,34,76],'Returns':[0,0,6,0,2,5,6,5,2,13],'View':
[0,655,11,378,920,12100,75,1394,12368,14356],'Stock':[24,43,65,27,87,98,798,78,99,231]
})
df_product=pd.DataFrame({'ProductId':['1','2','3','4','5','6','7','8','9','10'],'ShippedUnits':
[6,8,0,4,27,3,4,14,158,96],'ShippedRevenue':[268,1705,1300,950,1700,33380,500,2200,21000,24565]
,'OrderedUnits':[23,78,95,52,60,76,68,92,34,76],'Returns':[0,0,6,0,2,5,6,5,2,13],'View':
[0,655,11,378,920,12100,75,1394,12368,14356],'Stock':[24,43,65,27,87,98,798,78,99,231]
})
df_product['score'] = df_product['ShippedUnits'] +1.2*df_product['ShippedRevenue']+df_product['OrderedUnits']-1.2*df_product['Returns']+df_product['View']+df_product['Stock']
df_product['score']=(df_product['score']-df_product['score'].min())/(df_product['score'].max()-df_product['score'].min())*100
df_product
df["Score"] = df["ShippedUnits"] + df["OrderedUnits"] \
+ df["View"] + df["Stock"] \
+ 1.2 * df["ShippedRevenue"] \
- 1.2 * df["Returns"]
df["Norm1"] = df["Score"] / df["Score"].max() * 100
df["Norm2"] = df["Score"] / df["Score"].sum() * 100
df["Norm3"] = (df["Score"] - df["Score"].min()) / (df["Score"].max() - df["Score"].min()) * 100
>>> df[["ProductId", "Score", "Norm1", "Norm2", "Norm3"]]
ProductId Score Norm1 Norm2 Norm3
0 1 374.6 0.715883 0.250040 0.000000
1 2 2830.0 5.408298 1.888986 4.726249
2 3 1723.8 3.294284 1.150613 2.596993
3 4 1601.0 3.059606 1.068646 2.360622
4 5 3131.6 5.984673 2.090300 5.306781
5 6 52327.0 100.000000 34.927558 100.000000
6 7 1537.8 2.938827 1.026460 2.238973
7 8 4212.0 8.049382 2.811452 7.386377
8 9 37856.6 72.346208 25.268763 72.146811
9 10 44221.4 84.509718 29.517180 84.398026

get product of last 12 months using pandas or last 12 records in my df

I have dataset of values of end of month. I have to multiply last 12 values and sve it in "val' column. How can I do that?
I tried to loop using shift and also grouper but it did not work.
My code:
filtered_df=df.copy()
filtered_df = filtered_df[filtered_df['monthly'].notnull()]
for index, row in filtered_df.iterrows():
if index > 12:
pre_1 = row.shift(1)
pre_2 = row.shift(2)
pre_3 = row.shift(3)
pre_4 = row.shift(4)
pre_5 = row.shift(5)
pre_6 = row.shift(-6)
pre_7 = row.shift(-7)
pre_8 = row.shift(-8)
pre_9 = row.shift(-9)
pre_10 = row.shift(-10)
pre_11 = row.shift(-11)
pre_12 = row.shift(-12)
all_vals = (pre_1['monthly'] * pre_2['monthly'] * pre_3['monthly'] * pre_4[
'monthly'] * pre_5['monthly'] * pre_6['monthly'] * pre_7['monthly'] * pre_8[
'monthly'] * pre_9['monthly'] * pre_10['monthly'] * pre_11[
'monthly'] * pre_12['monthly'])
row['generic_momentum'] = all_vals
But I'm getting nan values and also it is not picking the right column
I also tried this but it is not working:
df.tail(12).prod()
Dataset
Date monthly val
31/01/11 0.959630357
28/02/11 0.939530957
31/03/11 1.024870166
31/05/11 0.956831905
30/06/11 1.06549785
30/09/11 0.903054795
31/10/11 1.027355404
30/11/11 0.893328025
31/01/12 1.015152156
29/02/12 1.05621569
30/04/12 1.116884715
31/05/12 0.878896927
31/07/12 0.950743984
31/08/12 1.094999121
31/10/12 0.94769417
30/11/12 1.073116682
31/12/12 0.986747164
31/01/13 0.975354237
28/02/13 0.888879072
30/04/13 0.940063889
31/05/13 1.017259688
31/07/13 0.990201439
30/09/13 1.018815133
31/10/13 1.088671085
31/12/13 1.104019842
31/01/14 0.989041096
28/02/14 1.017825485
31/03/14 0.960047355
30/04/14 1.064095477
30/06/14 1.023850957
31/07/14 1.08941545
30/09/14 1.065516629
31/10/14 0.984540626
31/12/14 1.023386988
28/02/15 1.150857956
31/03/15 1.01209752
30/04/15 1.00295515
30/06/15 1.043231635
31/07/15 1.042820448
31/08/15 1.241814907
30/09/15 1.014741935
30/11/15 0.980878108
31/12/15 0.995258408
29/02/16 1.0507026
31/03/16 1.033018209
31/05/16 0.931798992
30/06/16 1.032879184
31/08/16 0.881060764
30/09/16 1.000240668
30/11/16 0.849364675
31/01/17 1.075015059
28/02/17 0.933706879
31/03/17 1.036073194
31/05/17 1.203092255
30/06/17 0.956726321
31/07/17 1.010709024
31/08/17 1.102072394
31/10/17 0.99223153
30/11/17 1.088148242
31/01/18 0.982730721
28/02/18 1.102215081
IIUC: Use a combination of pd.Series.rolling and np.prod
df['monthly val'].rolling(12).apply(np.prod)
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
5 NaN
6 NaN
7 NaN
8 NaN
9 NaN
10 NaN
11 0.821766
12 0.814156
13 0.948878
14 0.877424
15 0.984058
16 0.911327
17 0.984289
....
An alternative is to use cumprod and shift
df['monthly val'].cumprod().pipe(lambda s: s / s.shift(12))

Multiplying data within columns python

I've been working on this all morning and for the life of me cannot figure it out. I'm sure this is very basic, but I've become so frustrated my mind is being clouded. I'm attempting to calculate the total return of a portfolio of securities at each date (monthly).
The formula is (1 + r1) * (1+r2) * (1+ r(t))..... - 1
Here is what I'm working with:
Adj_Returns = Adj_Close/Adj_Close.shift(1)-1
Adj_Returns['Risk Parity Portfolio'] = (Adj_Returns.loc['2003-01-31':]*Weights.shift(1)).sum(axis = 1)
Adj_Returns
SPY IYR LQD Risk Parity Portfolio
Date
2002-12-31 NaN NaN NaN 0.000000
2003-01-31 -0.019802 -0.014723 0.000774 -0.006840
2003-02-28 -0.013479 0.019342 0.015533 0.011701
2003-03-31 -0.001885 0.010015 0.001564 0.003556
2003-04-30 0.088985 0.045647 0.020696 0.036997
For example, with 2002-12-31 being base 100 for risk parity, I want 2003-01-31 to be 99.316 (100 * (1-0.006840)), 2003-02-28 to be 100.478 (99.316 * (1+ 0.011701)) so on and so forth.
Thanks!!
You want to use pd.DataFrame.cumprod
df.add(1).cumprod().sub(1).sum(1)
Consider the dataframe of returns df
np.random.seed([3,1415])
df = pd.DataFrame(np.random.normal(.025, .03, (10, 5)), columns=list('ABCDE'))
df
A B C D E
0 -0.038892 -0.013054 -0.034115 -0.042772 0.014521
1 0.024191 0.034487 0.035463 0.046461 0.048123
2 0.006754 0.035572 0.014424 0.012524 -0.002347
3 0.020724 0.047405 -0.020125 0.043341 0.037007
4 -0.003783 0.069827 0.014605 -0.019147 0.056897
5 0.056890 0.042756 0.033886 0.001758 0.049944
6 0.069609 0.032687 -0.001997 0.036253 0.009415
7 0.026503 0.053499 -0.006013 0.053447 0.047013
8 0.062084 0.029664 -0.015238 0.029886 0.062748
9 0.048341 0.065248 -0.024081 0.019139 0.028955
We can see the cumulative return or total return is
df.add(1).cumprod().sub(1)
A B C D E
0 -0.038892 -0.013054 -0.034115 -0.042772 0.014521
1 -0.015641 0.020983 0.000139 0.001702 0.063343
2 -0.008993 0.057301 0.014565 0.014247 0.060847
3 0.011544 0.107423 -0.005853 0.058206 0.100105
4 0.007717 0.184750 0.008666 0.037944 0.162699
5 0.065046 0.235405 0.042847 0.039769 0.220768
6 0.139183 0.275786 0.040764 0.077464 0.232261
7 0.169375 0.344039 0.034505 0.135051 0.290194
8 0.241974 0.383909 0.018742 0.168973 0.371151
9 0.302013 0.474207 -0.005791 0.191346 0.410852
Plot it
df.add(1).cumprod().sub(1).plot()
Add sum of returns to new column
df.assign(Portfolio=df.add(1).cumprod().sub(1).sum(1))
A B C D E Portfolio
0 -0.038892 -0.013054 -0.034115 -0.042772 0.014521 -0.114311
1 0.024191 0.034487 0.035463 0.046461 0.048123 0.070526
2 0.006754 0.035572 0.014424 0.012524 -0.002347 0.137967
3 0.020724 0.047405 -0.020125 0.043341 0.037007 0.271425
4 -0.003783 0.069827 0.014605 -0.019147 0.056897 0.401777
5 0.056890 0.042756 0.033886 0.001758 0.049944 0.603835
6 0.069609 0.032687 -0.001997 0.036253 0.009415 0.765459
7 0.026503 0.053499 -0.006013 0.053447 0.047013 0.973165
8 0.062084 0.029664 -0.015238 0.029886 0.062748 1.184749
9 0.048341 0.065248 -0.024081 0.019139 0.028955 1.372626

transitioning from r to Python - dplyr-like operations in pandas

I'm used to using R. If I had this in R I would do something like this:
library(dplyr)
df = df %>%
mutate(
XYZ = sum(x+y+z),
weekcheck = ifelse( week > 3 & X*2 > 4, 'yes',week), # multi-step if statement
XYZ_plus_3 = XYZ + 3
)
df = pd.DataFrame({
'x': np.random.uniform(1., 168., 20),
'y': np.random.uniform(7., 334., 20),
'z': np.random.uniform(1.7, 20.7, 20),
'month': [5,6,7,8]*5,
'week': np.random.randint(1,4, 20)
})
I know theres assign but I can't figure out the syntax for chaining these operations together, particularly using IFELSE sort of thing.
Anyone attempt to break this down for me? Even if you don't know R I think the code is fairly common sense..
You'd need two assign calls for that and the syntax is not as pretty:
(df.assign(XYZ=df[['x', 'y', 'z']].sum(axis=1),
weekcheck=np.where((df['week']>3) & (df['x']*2>4), 'yes', df['week']))
.assign(XYZ_plus_3=lambda d: d['XYZ']+3))
Not sure if this is what you're looking for, but I would do it like this in pandas. In particular, I think that np.where() is a direct analog to R's ifelse (I don't know R very well though). There may be similar way to do this in pandas but I've always found np.where() to be the fastest and most general approach.
df['xyz'] = df.x + df.y + df.z
df['wcheck'] = np.where( (df.week>2) & (df.x*2>4), 'yes', df.week )
df['xyz_p3'] = df.xyz + 3
week x y z xyz wcheck xyz_p3
0 2 1.968759 31.537797 18.984273 52.490830 2 55.490830
1 1 108.809481 295.126414 14.250059 418.185954 1 421.185954
2 3 124.094087 201.229196 15.346794 340.670077 yes 343.670077
3 2 122.874717 110.675192 6.179610 239.729519 2 242.729519
4 1 74.909326 12.484076 4.921888 92.315290 1 95.315290
You could do some or all of this as a method chain, although I don't see a particular advantage here beyond making the code a little more compact and clean (not that I am knocking that!). But much of the difference is just three lines vs "one line" that is spread across three lines...
I dunno, YMMV, but a lot of this comes down to specific examples and in this case I would just do it in three separate lines of pandas as opposed to figuring out how to do it as a method chain with assign or pipe.
Here is how you'd do it with datar, a python package that ports dplyr and other packages into python, and follows their API design:
In [1]: from datar.all import *
In [2]: df = tibble(
...: x=runif(20, 1., 168.),
...: y=runif(20, 7., 334.),
...: z=runif(20, 1.7, 20.7),
...: month=[5,6,7,8]*5,
...: week=rnorm(20, 1, 4)
...: )
In [3]: df
Out[3]:
x y z month week
0 122.186045 210.469468 3.685605 5 2.832896
1 165.584417 328.907586 8.535625 6 -0.277586
2 47.149510 205.991526 8.302771 7 -3.212263
3 88.110641 137.452398 11.920447 8 -3.307180
4 157.378195 215.928386 19.047386 5 0.442600
5 115.881867 122.972666 20.367191 6 -2.810770
6 70.939125 303.212096 2.864381 7 1.676704
7 124.173937 159.179588 16.231502 8 -1.431897
8 67.049824 266.658257 2.483528 5 -4.815040
9 165.531614 315.180892 13.855680 6 4.094581
10 59.077945 87.218260 10.638067 7 -0.204437
11 160.982998 320.093002 9.470513 8 -1.877375
12 23.520600 143.737008 1.989666 5 2.344435
13 26.028670 261.396529 19.844300 6 1.956208
14 100.008859 261.133030 15.947817 7 3.202203
15 102.298540 29.667462 4.470771 8 -4.747893
16 38.565169 239.578190 11.088213 5 0.268926
17 73.553130 49.714928 4.449677 6 -3.592172
18 74.467545 16.350189 8.195442 7 3.451417
19 162.439950 189.721896 7.729186 8 4.486240
In [4]: df >> rowwise() >> mutate(
...: XYZ=sum(f.x+f.y+f.z),
...: weekcheck=if_else((f.week > 3) & (f.x*2 > 4), 'yes', f.week),
...: XYZ_plus_3=f.XYZ+3
...: )
Out[4]:
x y z month week XYZ weekcheck XYZ_plus_3
0 122.186045 210.469468 3.685605 5 2.832896 336.341118 2.832896 339.341118
1 165.584417 328.907586 8.535625 6 -0.277586 503.027628 -0.277586 506.027628
2 47.149510 205.991526 8.302771 7 -3.212263 261.443807 -3.212263 264.443807
3 88.110641 137.452398 11.920447 8 -3.307180 237.483487 -3.30718 240.483487
4 157.378195 215.928386 19.047386 5 0.442600 392.353967 0.4426 395.353967
5 115.881867 122.972666 20.367191 6 -2.810770 259.221724 -2.81077 262.221724
6 70.939125 303.212096 2.864381 7 1.676704 377.015603 1.676704 380.015603
7 124.173937 159.179588 16.231502 8 -1.431897 299.585026 -1.431897 302.585026
8 67.049824 266.658257 2.483528 5 -4.815040 336.191610 -4.81504 339.191610
9 165.531614 315.180892 13.855680 6 4.094581 494.568187 yes 497.568187
10 59.077945 87.218260 10.638067 7 -0.204437 156.934272 -0.204437 159.934272
11 160.982998 320.093002 9.470513 8 -1.877375 490.546514 -1.877375 493.546514
12 23.520600 143.737008 1.989666 5 2.344435 169.247274 2.344435 172.247274
13 26.028670 261.396529 19.844300 6 1.956208 307.269499 1.956208 310.269499
14 100.008859 261.133030 15.947817 7 3.202203 377.089707 yes 380.089707
15 102.298540 29.667462 4.470771 8 -4.747893 136.436772 -4.747893 139.436772
16 38.565169 239.578190 11.088213 5 0.268926 289.231572 0.268926 292.231572
17 73.553130 49.714928 4.449677 6 -3.592172 127.717735 -3.592172 130.717735
18 74.467545 16.350189 8.195442 7 3.451417 99.013176 yes 102.013176
19 162.439950 189.721896 7.729186 8 4.486240 359.891031 yes 362.891031
[Rowwise: []]
I am the author of the package. Feel free to submit issues or ask me questions about using it.

Categories