I am trying to create a Beta for stocks on 756 days, calculating the Covariance of the stocks Weighted on an index and divided by the Variance of the index.
When i run it for a DF with a single stock, without the groupby arguments, it runs and creates the Beta Column, but i need to do it for all the stocks on my DF and the Groupby was a solution that i read here on Stack Overflow, but i am not able to use it yet.
This is the line that causes the error
df2['Beta 756d'] = df2.groupby('CODNEG').apply(df2['Retorno_Ação'].rolling(window=756,center=False).cov(df2['Retorno_Ibov']) / df2['Retorno_Ibov'].rolling(window=756,center=False).var())
--
This error comes up when the code gets to the line above
TypeError: 'Series' objects are mutable, thus they cannot be hashed
--
Here is an example of df2
TIPREG DATE CODBDI CODNEG TPMERC NOMRES ESPECI PRAZOT MODREF PREABE PREMAX PREMIN PREMED PREULT PREOFC PREOFV TOTNEG QUATOT VOLTOT PREEXE INDOPC DATVEN FATCOT PTOEXE CODISI DISMES Retorno_Ação Retorno_Ibov
0 1 1995-01-02 2 ACE 3 10 ACESITA ON *INT R$ 6300 6300 6300 6300 6300 6300 6500 1 200000 1260000 0 0 99991231 1000 0 ACESACON 119 NaN NaN
105 1 1995-01-02 2 PET 3 10 PETROBRAS ON * R$ 6400 6400 6250 6287 6250 6250 6750 2 40000 251500 0 0 99991231 1000 0 PETRACON 132 NaN NaN
106 1 1995-01-02 2 PET 4 10 PETROBRAS PN * R$ 10700 11000 10400 10599 10500 10500 10650 234 31210000 330805170 0 0 99991231 1000 0 PETRACPN 133 NaN NaN
107 1 1995-01-02 2 BRD 4 10 PETROBRAS BR PN * R$ 4600 4600 4540 4591 4540 4333 4500 13 18200000 83566000 0 0 99991231 1000 0 BRDTACPN 102 NaN NaN
108 1 1995-01-02 2 PTN 4 10 PETTENATI PN * R$ 5189 5189 5189 5189 5189 4700 5280 1 5000000 25945000 0 0 99991231 1000 0 BRPTNTACNPR3 21 NaN NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1826575 1 2020-08-19 2 HGTX3 10 CIA HERING ON NM R$ 1576 1598 1547 1575 1575 1574 1575 14582 4640200 7309217700 0 0 99991231 1 0 BRHGTXACNOR9 151 0.005105 0.000788
1826576 1 2020-08-19 2 HOME34 10 HOME DEPOT DRN R$ 77798 78612 77798 78413 78612 55000 0 2 1720 134870760 0 0 99991231 1 0 BRHOMEBDR002 133 0.005719 0.000788
1826577 1 2020-08-19 2 HONB34 10 HONEYWELL DRN ED R$ 86347 86347 86347 86347 86347 0 0 1 100 8634700 0 0 99991231 1 0 BRHONBBDR006 127 0.085375 0.000788
1826579 1 2020-08-19 2 HYPE3 10 HYPERA ON NM R$ 3248 3265 3167 3199 3175 3175 3177 15332 3082700 9864018200 0 0 99991231 1 0 BRHYPEACNOR0 122 -0.034367 0.000788
1826517 1 2020-08-19 2 EMAE4 10 EMAE PN R$ 2869 2954 2831 2888 2952 2832 2953 10 1100 3177600 0 0 99991231 1 0 BREMAEACNPR1 114 0.040903 0.000788
I am using the last two columns (Retorno_Ação, Retorno_Ibov) to calculate the Cov and the Var to generate the Beta.
Can anyone tell me what is causing the error?
This line works fine when the DF has only one stock:
df2['Beta 756d'] = df2['Retorno_Ação'].rolling(window=756,center=False).cov(df2['Retorno_Ibov']) / df2['Retorno_Ibov'].rolling(window=756,center=False).var()
--
The error happens when i use df2['Beta 756d'] = df2.groupby('CODNEG').apply(
Related
How can I merge and sum the columns with the same name?
So the output should be 1 Column named Canada as a result of the sum of the 4 columns named Canada.
Country/Region Brazil Canada Canada Canada Canada
Week 1 0 3 0 0 0
Week 2 0 17 0 0 0
Week 3 0 21 0 0 0
Week 4 0 21 0 0 0
Week 5 0 23 0 0 0
Week 6 0 80 0 5 0
Week 7 0 194 0 20 0
Week 8 12 702 3 199 20
Week 9 182 2679 16 2395 260
Week 10 737 8711 80 17928 892
Week 11 1674 25497 153 48195 1597
Week 12 2923 46392 175 85563 2003
Week 13 4516 76095 182 122431 2180
Week 14 6002 105386 183 163539 2431
Week 15 6751 127713 189 210409 2995
Week 16 7081 147716 189 258188 3845
From its current state, this should give the outcome you're looking for:
df = df.set_index('Country/Region') # optional
df.groupby(df.columns, axis=1).sum() # Stolen from Scott Boston as it's a superior method.
Output:
index Brazil Canada
Country/Region
Week 1 0 3
Week 2 0 17
Week 3 0 21
Week 4 0 21
Week 5 0 23
Week 6 0 85
Week 7 0 214
Week 8 12 924
Week 9 182 5350
Week 10 737 27611
Week 11 1674 75442
Week 12 2923 134133
Week 13 4516 200888
Week 14 6002 271539
Week 15 6751 341306
Week 16 7081 409938
I found your dataset interesting, here's how I would clean it up from step 1:
df = pd.read_csv('file.csv')
df = df.set_index(['Province/State', 'Country/Region', 'Lat', 'Long']).stack().reset_index()
df.columns = ['Province/State', 'Country/Region', 'Lat', 'Long', 'date', 'value']
df['date'] = pd.to_datetime(df['date'])
df = df.set_index('date')
df = df.pivot_table(index=df.index, columns='Country/Region', values='value', aggfunc=np.sum)
print(df)
Output:
Country/Region Afghanistan Albania Algeria Andorra Angola ... West Bank and Gaza Western Sahara Yemen Zambia Zimbabwe
date ...
2020-01-22 0 0 0 0 0 ... 0 0 0 0 0
2020-01-23 0 0 0 0 0 ... 0 0 0 0 0
2020-01-24 0 0 0 0 0 ... 0 0 0 0 0
2020-01-25 0 0 0 0 0 ... 0 0 0 0 0
2020-01-26 0 0 0 0 0 ... 0 0 0 0 0
... ... ... ... ... ... ... ... ... ... ... ...
2020-07-30 36542 5197 29831 922 1109 ... 11548 10 1726 5555 3092
2020-07-31 36675 5276 30394 925 1148 ... 11837 10 1728 5963 3169
2020-08-01 36710 5396 30950 925 1164 ... 12160 10 1730 6228 3659
2020-08-02 36710 5519 31465 925 1199 ... 12297 10 1734 6347 3921
2020-08-03 36747 5620 31972 937 1280 ... 12541 10 1734 6580 4075
If you now want to do weekly aggregations, it's as simple as:
print(df.resample('w').sum())
Output:
Country/Region Afghanistan Albania Algeria Andorra Angola ... West Bank and Gaza Western Sahara Yemen Zambia Zimbabwe
date ...
2020-01-26 0 0 0 0 0 ... 0 0 0 0 0
2020-02-02 0 0 0 0 0 ... 0 0 0 0 0
2020-02-09 0 0 0 0 0 ... 0 0 0 0 0
2020-02-16 0 0 0 0 0 ... 0 0 0 0 0
2020-02-23 0 0 0 0 0 ... 0 0 0 0 0
2020-03-01 7 0 6 0 0 ... 0 0 0 0 0
2020-03-08 10 0 85 7 0 ... 43 0 0 0 0
2020-03-15 57 160 195 7 0 ... 209 0 0 0 0
2020-03-22 175 464 705 409 5 ... 309 0 0 11 7
2020-03-29 632 1142 2537 1618 29 ... 559 0 0 113 31
2020-04-05 1783 2000 6875 2970 62 ... 1178 4 0 262 59
2020-04-12 3401 2864 11629 4057 128 ... 1847 30 3 279 84
2020-04-19 5838 3603 16062 4764 143 ... 2081 42 7 356 154
2020-04-26 8918 4606 21211 5087 174 ... 2353 42 7 541 200
2020-05-03 15149 5391 27943 5214 208 ... 2432 42 41 738 244
2020-05-10 25286 5871 36315 5265 274 ... 2607 42 203 1260 241
2020-05-17 39634 6321 45122 5317 327 ... 2632 42 632 3894 274
2020-05-24 61342 6798 54185 5332 402 ... 2869 45 1321 5991 354
2020-05-31 91885 7517 62849 5344 536 ... 3073 63 1932 7125 894
2020-06-07 126442 8378 68842 5868 609 ... 3221 63 3060 7623 1694
2020-06-14 159822 9689 74147 5967 827 ... 3396 63 4236 8836 2335
2020-06-21 191378 12463 79737 5981 1142 ... 4466 63 6322 9905 3089
2020-06-28 210487 15349 87615 5985 1522 ... 10242 70 7360 10512 3813
2020-07-05 224560 18707 102918 5985 2186 ... 21897 70 8450 11322 4426
2020-07-12 237087 22399 124588 5985 2940 ... 36949 70 9489 13002 6200
2020-07-19 245264 26845 149611 6098 4279 ... 52323 70 10855 16350 9058
2020-07-26 250970 31255 178605 6237 5919 ... 68154 70 11571 26749 14933
2020-08-02 255739 36370 208457 6429 7648 ... 80685 70 12023 38896 22241
2020-08-09 36747 5620 31972 937 1280 ... 12541 10 1734 6580 4075
Try:
np.random.seed(0)
df = pd.DataFrame(np.random.randint(0,100,(20,5)), columns=[*'ZAABC'])
df.groupby(df.columns, axis=1, sort=False).sum()
Output:
Z A B C
0 44 111 67 67
1 9 104 36 87
2 70 176 12 58
3 65 126 46 88
4 81 62 77 72
5 9 100 69 79
6 47 146 99 88
7 49 48 19 14
8 39 97 9 57
9 32 105 23 35
10 75 83 34 0
11 0 89 5 38
12 17 83 42 58
13 31 66 41 57
14 35 57 82 91
15 0 113 53 12
16 42 159 68 6
17 68 50 76 52
18 78 35 99 58
19 23 92 85 48
You can try a transpose and groupby, e.g. something similar to the below.
df_T = df.tranpose()
df_T.groupby(df_T.index).sum()['Canada']
Here's a way to do it:
df.columns = [(col + str(i)) if col.startswith('Canada') else col for i, col in enumerate(df.columns)]
df = df.assign(Canada=df.filter(like='Canada').sum(axis=1)).drop(columns=[x for x in df.columns if x.startswith('Canada') and x != 'Canada'])
First we rename the columns starting with Canada by appending their integer position, which ensures they are no longer duplicates.
Then we use sum() to add across columns like Canada, put the result in a new column named Canada, and drop the columns that were originally named Canada.
Full test code is:
import pandas as pd
df = pd.DataFrame(
columns=[x.strip() for x in 'Brazil Canada Canada Canada Canada'.split()],
index=['Week ' + str(i) for i in range(1, 17)],
data=[[i] * 5 for i in range(1, 17)])
df.columns.names=['Country/Region']
print(df)
df.columns = [(col + str(i)) if col.startswith('Canada') else col for i, col in enumerate(df.columns)]
df = df.assign(Canada=df.filter(like='Canada').sum(axis=1)).drop(columns=[x for x in df.columns if x.startswith('Canada') and x != 'Canada'])
print(df)
Output:
Country/Region Brazil Canada Canada Canada Canada
Week 1 1 1 1 1 1
Week 2 2 2 2 2 2
Week 3 3 3 3 3 3
Week 4 4 4 4 4 4
Week 5 5 5 5 5 5
Week 6 6 6 6 6 6
Week 7 7 7 7 7 7
Week 8 8 8 8 8 8
Week 9 9 9 9 9 9
Week 10 10 10 10 10 10
Week 11 11 11 11 11 11
Week 12 12 12 12 12 12
Week 13 13 13 13 13 13
Week 14 14 14 14 14 14
Week 15 15 15 15 15 15
Week 16 16 16 16 16 16
Brazil Canada
Week 1 1 4
Week 2 2 8
Week 3 3 12
Week 4 4 16
Week 5 5 20
Week 6 6 24
Week 7 7 28
Week 8 8 32
Week 9 9 36
Week 10 10 40
Week 11 11 44
Week 12 12 48
Week 13 13 52
Week 14 14 56
Week 15 15 60
Week 16 16 64
I have the following DataFrame:
id x y timestamp sensorTime
1 32 30 1031 2002
1 4 105 1035 2005
1 8 110 1050 2006
2 18 10 1500 3600
2 40 20 1550 3610
2 80 10 1450 3620
....
import pandas as pd
import numpy as np
df = pd.DataFrame(np.array([[1,1,1,2,2,2], [32,4,8,18,40,80], [30,105,110,10,20,10], [1031,1035,1050,1500,1550,1450], [2002, 2005, 2006, 3600, 3610, 3620]])).T
df.columns = ['id', 'x', 'y', 'timestamp', 'sensorTime]
For each group grouped by id I would like to add the differences of the sensorTime to the first value of timestamp. Something like the following:
start = df.iloc[0]['timestamp']
df['sensorTime'] -= df.iloc[0]['sensorTime']
df['sensorTime'] += start
But I would like to do this for each id group separately.
The resulting DataFrame should be:
id x y timestamp sensorTime
1 32 30 1031 1031
1 4 105 1035 1034
1 8 110 1050 1035
2 18 10 1500 1500
2 40 20 1550 1510
2 80 10 1450 1520
....
How can this operation done per group?
df
id x y timestamp sensorTime
0 1 32 30 1031 2002
1 1 4 105 1035 2005
2 1 8 110 1050 2006
3 2 18 10 1500 3600
4 2 40 20 1550 3610
5 2 80 10 1450 3620
You can group by id and then pass both timestamp and sensorTime. Then you can use diff to get the difference of sensorTime. The first value would be NaN and you can replace it with the first value of timestamp of that group. Then you can simply do cumsum to get the desired output.
def func(x):
diff = x['sensorTime'].diff()
diff.iloc[0] = x['timestamp'].iloc[0]
return (diff.cumsum().to_frame())
df['sensorTime'] = df.groupby('id')[['timestamp', 'sensorTime']].apply(func)
df
id x y timestamp sensorTime
0 1 32 30 1031 1031.0
1 1 4 105 1035 1034.0
2 1 8 110 1050 1035.0
3 2 18 10 1500 1500.0
4 2 40 20 1550 1510.0
5 2 80 10 1450 1520.0
You could run a groupby twice, first, to get the difference in sensorTime, the second time to do the cumulative sum:
box = df.groupby("id").sensorTime.transform("diff")
df.assign(
new_sensorTime=np.where(box.isna(), df.timestamp, box),
new=lambda x: x.groupby("id")["new_sensorTime"].cumsum(),
).drop(columns="new_sensorTime")
id x y timestamp sensorTime new
0 1 32 30 1031 2002 1031.0
1 1 4 105 1035 2005 1034.0
2 1 8 110 1050 2006 1035.0
3 2 18 10 1500 3600 1500.0
4 2 40 20 1550 3610 1510.0
5 2 80 10 1450 3620 1520.0
I want estimate the strategy I make:
buy- where the K_Class is 1
sell- where the K_Class is 0
all prices is refered to Close Column at the time
for example:
Suppose that I have the amount of money 10000, the first time I buy is 2017/03/13, the first time I sell is 2017/03/17. The second time I buy is 2017/03/20, the second time I sell is on 2017/03/22
My Question: Till the end, how do I calculate the amount of money?
Time Close K_Class
0 2017/03/06 31.72 0
1 2017/03/08 33.99 0
2 2017/03/09 32.02 0
3 2017/03/10 30.66 0
4 2017/03/13 30.94 1
5 2017/03/15 32.56 1
6 2017/03/17 33.31 0
7 2017/03/20 34.07 1
8 2017/03/22 34.40 0
9 2017/03/24 32.98 1
10 2017/03/27 33.26 0
11 2017/03/28 31.60 0
12 2017/03/29 30.36 0
13 2017/03/30 28.83 0
14 2017/04/11 27.01 0
15 2017/04/12 24.31 0
16 2017/04/14 24.22 0
17 2017/04/17 21.80 0
18 2017/04/18 21.20 1
19 2017/04/19 23.32 1
20 2017/04/20 24.43 0
21 2017/04/24 23.85 1
22 2017/04/26 23.97 1
23 2017/04/27 24.31 1
24 2017/04/28 23.50 1
25 2017/05/02 22.57 1
26 2017/05/03 22.67 1
27 2017/05/04 22.11 1
28 2017/05/05 21.26 1
29 2017/05/08 19.37 1
.. ... ... ...
275 2018/08/01 13.38 0
276 2018/08/03 12.49 0
277 2018/08/06 12.50 0
278 2018/08/07 12.78 0
279 2018/08/09 12.93 0
280 2018/08/10 13.15 0
281 2018/08/13 13.14 1
282 2018/08/14 13.15 0
283 2018/08/15 12.80 0
284 2018/08/17 12.29 0
285 2018/08/21 12.39 0
286 2018/08/22 12.15 0
287 2018/08/23 12.27 0
288 2018/08/24 12.31 0
289 2018/08/27 12.47 0
290 2018/08/29 12.31 0
291 2018/08/30 12.13 0
292 2018/08/31 11.69 0
293 2018/09/03 11.60 1
294 2018/09/04 11.65 0
295 2018/09/05 11.45 0
296 2018/09/07 11.42 0
297 2018/09/10 10.71 0
298 2018/09/11 10.76 1
299 2018/09/12 10.74 0
300 2018/09/13 10.85 1
301 2018/09/14 10.79 0
302 2018/09/18 10.58 1
303 2018/09/19 10.65 1
304 2018/09/21 10.73 1
You can start with this:
df = pd.DataFrame({'price':np.arange(10), 'class':np.random.randint(2, size=10)})
df['diff'] = -1 * df['class'].diff()
df.loc[0,['diff']] = -1 * df.loc[0,['class']].values
df['money'] = df['price']*df['diff']
so the 'diff' represent the buy and sell action (-1 for buy and +1 for sell). The product of it and the price gives the changes of money you have. Sum it up, plus your initial money, you'll get your final money.
df['diff'] = df['K_Class'].diff()
stock_sell = 0
current_amount = 10000
for n in range(0, df.index.size-1):
print(n)
if df.iloc[n, 10] == 1:
stock_sell = current_amount/df.iloc[n, 4]
if df.iloc[n, 10] == -1:
current_amount = stock_sell*df.iloc[n, 4]
print(current_amount)
I want to estimate a trading strategy, given the amount I invest in a particular stock. Basically when I see "K-Class" is 1, I buy, when I see "K-Class" is 0, I sell. To make that simple engough, we can ignore the open, high, low value. just use the close price to estimate.
We do want to iterate the whole Series, following 1=buy 0=sell, no matter it is right or wrong.
I got a pandas DataFrame with a Series called "K-Class", a boolean, just say 1(buy) and 0(sell)
From the first day the 'K-class' appears 1, I buy, if the second day is 0, I sell 'immediately' at the close price
How can I write a for loop to test the afterall invest money and invest time?(using pandas and python technics)
Pleas feel free to add more variables
I got a
invest_amount = 10000
stock_owned = 10000/ p1 #the first day appears 1, return the close price
invest_time = 0
Time Close K_Class
0 2017/03/06 31.72 0
1 2017/03/08 33.99 0
2 2017/03/09 32.02 0
3 2017/03/10 30.66 0
4 2017/03/13 30.94 1
5 2017/03/15 32.56 1
6 2017/03/17 33.31 0
7 2017/03/20 34.07 1
8 2017/03/22 34.40 0
9 2017/03/24 32.98 1
10 2017/03/27 33.26 0
11 2017/03/28 31.60 0
12 2017/03/29 30.36 0
13 2017/03/30 28.83 0
14 2017/04/11 27.01 0
15 2017/04/12 24.31 0
16 2017/04/14 24.22 0
17 2017/04/17 21.80 0
18 2017/04/18 21.20 1
19 2017/04/19 23.32 1
20 2017/04/20 24.43 0
21 2017/04/24 23.85 1
22 2017/04/26 23.97 1
23 2017/04/27 24.31 1
24 2017/04/28 23.50 1
25 2017/05/02 22.57 1
26 2017/05/03 22.67 1
27 2017/05/04 22.11 1
28 2017/05/05 21.26 1
29 2017/05/08 19.37 1
.. ... ... ...
275 2018/08/01 13.38 0
276 2018/08/03 12.49 0
277 2018/08/06 12.50 0
278 2018/08/07 12.78 0
279 2018/08/09 12.93 0
280 2018/08/10 13.15 0
281 2018/08/13 13.14 1
282 2018/08/14 13.15 0
283 2018/08/15 12.80 0
284 2018/08/17 12.29 0
285 2018/08/21 12.39 0
286 2018/08/22 12.15 0
287 2018/08/23 12.27 0
288 2018/08/24 12.31 0
289 2018/08/27 12.47 0
290 2018/08/29 12.31 0
291 2018/08/30 12.13 0
292 2018/08/31 11.69 0
293 2018/09/03 11.60 1
294 2018/09/04 11.65 0
295 2018/09/05 11.45 0
296 2018/09/07 11.42 0
297 2018/09/10 10.71 0
298 2018/09/11 10.76 1
299 2018/09/12 10.74 0
300 2018/09/13 10.85 1
301 2018/09/14 10.79 0
302 2018/09/18 10.58 1
303 2018/09/19 10.65 1
304 2018/09/21 10.73 1
Given the following dataframe df:
app platform uuid minutes
0 1 0 a696ccf9-22cb-428b-adee-95c9a97a4581 67
1 2 0 8e17a2eb-f0ee-49ae-b8c2-c9f9926aa56d 1
2 2 1 40AD6CD1-4A7B-48DD-8815-1829C093A95C 13
3 1 0 26c1022a-7a8e-42a2-b7cc-bea6bffa7a6f 2
4 2 0 34271596-eebb-4423-b890-dc3761ed37ca 8
5 3 1 C57D0F52-B565-4322-85D2-C2798F7CA6FF 16
6 2 0 245501ec2e39cb782bab1fb02d7813b7 1
7 3 1 DE6E4714-5A3C-4C80-BD81-EAACB2364DF0 30
8 3 0 f88eb774-fdf3-4d1d-a91d-0b4ab95cf36e 10
9 2 0 9c08c860-7a6d-4810-a5c3-f3af2a3fcf66 470
10 3 1 19fdaedfd0dbdaf6a7a6b49619f11a19 3
11 3 1 AAF1CFF7-4564-4C79-B2D8-F0AAF9C9971B 58
12 2 0 4eb1024b-c293-42a4-95a2-31b20c3b524b 24
13 3 1 8E0B0BE3-8553-4F38-9837-6C907E01F84C 7
14 3 1 E8B2849C-F050-4DCD-B311-5D57015466AE 465
15 2 0 ec7fedb6-b118-424a-babe-b8ffad579685 266
16 1 0 7e302dcb-ceaf-406c-a9e5-66933d921064 184
17 2 0 f786528ded200c9f553dd3a5e9e9bb2d 10
18 3 1 1E291633-AF27-4DFB-8DA4-4A5B63F175CF 13
19 2 0 953a525c-97e0-4c2f-90e0-dfebde3ec20d 2408`
I'll group it:
y=df.groupby(['app','platform','uuid']).sum().reset_index().sort(['app','platform','minutes'],ascending=[1,1,0]).set_index(['app','platform','uuid'])
minutes
app platform uuid
1 0 7e302dcb-ceaf-406c-a9e5-66933d921064 184
a696ccf9-22cb-428b-adee-95c9a97a4581 67
26c1022a-7a8e-42a2-b7cc-bea6bffa7a6f 2
2 0 953a525c-97e0-4c2f-90e0-dfebde3ec20d 2408
9c08c860-7a6d-4810-a5c3-f3af2a3fcf66 470
ec7fedb6-b118-424a-babe-b8ffad579685 266
4eb1024b-c293-42a4-95a2-31b20c3b524b 24
f786528ded200c9f553dd3a5e9e9bb2d 10
34271596-eebb-4423-b890-dc3761ed37ca 8
245501ec2e39cb782bab1fb02d7813b7 1
8e17a2eb-f0ee-49ae-b8c2-c9f9926aa56d 1
1 40AD6CD1-4A7B-48DD-8815-1829C093A95C 13
3 0 f88eb774-fdf3-4d1d-a91d-0b4ab95cf36e 10
1 E8B2849C-F050-4DCD-B311-5D57015466AE 465
AAF1CFF7-4564-4C79-B2D8-F0AAF9C9971B 58
DE6E4714-5A3C-4C80-BD81-EAACB2364DF0 30
C57D0F52-B565-4322-85D2-C2798F7CA6FF 16
1E291633-AF27-4DFB-8DA4-4A5B63F175CF 13
8E0B0BE3-8553-4F38-9837-6C907E01F84C 7
19fdaedfd0dbdaf6a7a6b49619f11a19 3
So that I got its minutes per uuid in decrescent order.
Now, I will sum the cumulative minutes per app/platform/uuid:
y.groupby(level=[0,1]).cumsum()
app platform uuid
1 0 7e302dcb-ceaf-406c-a9e5-66933d921064 184
a696ccf9-22cb-428b-adee-95c9a97a4581 251
26c1022a-7a8e-42a2-b7cc-bea6bffa7a6f 253
2 0 953a525c-97e0-4c2f-90e0-dfebde3ec20d 2408
9c08c860-7a6d-4810-a5c3-f3af2a3fcf66 2878
ec7fedb6-b118-424a-babe-b8ffad579685 3144
4eb1024b-c293-42a4-95a2-31b20c3b524b 3168
f786528ded200c9f553dd3a5e9e9bb2d 3178
34271596-eebb-4423-b890-dc3761ed37ca 3186
245501ec2e39cb782bab1fb02d7813b7 3187
8e17a2eb-f0ee-49ae-b8c2-c9f9926aa56d 3188
1 40AD6CD1-4A7B-48DD-8815-1829C093A95C 13
3 0 f88eb774-fdf3-4d1d-a91d-0b4ab95cf36e 10
1 E8B2849C-F050-4DCD-B311-5D57015466AE 465
AAF1CFF7-4564-4C79-B2D8-F0AAF9C9971B 523
DE6E4714-5A3C-4C80-BD81-EAACB2364DF0 553
C57D0F52-B565-4322-85D2-C2798F7CA6FF 569
1E291633-AF27-4DFB-8DA4-4A5B63F175CF 582
8E0B0BE3-8553-4F38-9837-6C907E01F84C 589
19fdaedfd0dbdaf6a7a6b49619f11a19 592
My question is: how can I get the percent agains the total cumulative sum, per group, i.e, something like this:
app platform uuid
1 0 7e302dcb-ceaf-406c-a9e5-66933d921064 184 0.26
a696ccf9-22cb-428b-adee-95c9a97a4581 251 0.36
26c1022a-7a8e-42a2-b7cc-bea6bffa7a6f 253 0.36
...
...
...
It's not clear you came up with 0.26, 0.36 in your desired output - but assuming those are just dummy numbers, to get a running % of total for each group, you could do this:
y['cumsum'] = y.groupby(level=[0,1]).cumsum()
y['running_pct'] = y.groupby(level=[0,1])['cumsum'].transform(lambda x: x / x.iloc[-1])
Should give the right output.
In [398]: y['running_pct'].head()
Out[398]:
app platform uuid
1 0 7e302dcb-ceaf-406c-a9e5-66933d921064 0.727273
a696ccf9-22cb-428b-adee-95c9a97a4581 0.992095
26c1022a-7a8e-42a2-b7cc-bea6bffa7a6f 1.000000
2 0 953a525c-97e0-4c2f-90e0-dfebde3ec20d 0.755332
9c08c860-7a6d-4810-a5c3-f3af2a3fcf66 0.902760
Name: running_pct, dtype: float64
EDIT:
Per the comments, if you're looking to wring out a little more performance, this will be faster as of version 0.14.1
y['cumsum'] = y.groupby(level=[0,1])['minutes'].transform('cumsum')
y['running_pct'] = y['cumsum'] / y.groupby(level=[0,1])['minutes'].transform('sum')
And as #Jeff notes, in 0.15.0 this may be faster yet.
y['running_pct'] = y['cumsum'] / y.groupby(level=[0,1])['minutes'].transform('last')