How to do Stock Trading Back Test Using Pandas and Basic Iteration? - python

I want to estimate a trading strategy, given the amount I invest in a particular stock. Basically when I see "K-Class" is 1, I buy, when I see "K-Class" is 0, I sell. To make that simple engough, we can ignore the open, high, low value. just use the close price to estimate.
We do want to iterate the whole Series, following 1=buy 0=sell, no matter it is right or wrong.
I got a pandas DataFrame with a Series called "K-Class", a boolean, just say 1(buy) and 0(sell)
From the first day the 'K-class' appears 1, I buy, if the second day is 0, I sell 'immediately' at the close price
How can I write a for loop to test the afterall invest money and invest time?(using pandas and python technics)
Pleas feel free to add more variables
I got a
invest_amount = 10000
stock_owned = 10000/ p1 #the first day appears 1, return the close price
invest_time = 0
Time Close K_Class
0 2017/03/06 31.72 0
1 2017/03/08 33.99 0
2 2017/03/09 32.02 0
3 2017/03/10 30.66 0
4 2017/03/13 30.94 1
5 2017/03/15 32.56 1
6 2017/03/17 33.31 0
7 2017/03/20 34.07 1
8 2017/03/22 34.40 0
9 2017/03/24 32.98 1
10 2017/03/27 33.26 0
11 2017/03/28 31.60 0
12 2017/03/29 30.36 0
13 2017/03/30 28.83 0
14 2017/04/11 27.01 0
15 2017/04/12 24.31 0
16 2017/04/14 24.22 0
17 2017/04/17 21.80 0
18 2017/04/18 21.20 1
19 2017/04/19 23.32 1
20 2017/04/20 24.43 0
21 2017/04/24 23.85 1
22 2017/04/26 23.97 1
23 2017/04/27 24.31 1
24 2017/04/28 23.50 1
25 2017/05/02 22.57 1
26 2017/05/03 22.67 1
27 2017/05/04 22.11 1
28 2017/05/05 21.26 1
29 2017/05/08 19.37 1
.. ... ... ...
275 2018/08/01 13.38 0
276 2018/08/03 12.49 0
277 2018/08/06 12.50 0
278 2018/08/07 12.78 0
279 2018/08/09 12.93 0
280 2018/08/10 13.15 0
281 2018/08/13 13.14 1
282 2018/08/14 13.15 0
283 2018/08/15 12.80 0
284 2018/08/17 12.29 0
285 2018/08/21 12.39 0
286 2018/08/22 12.15 0
287 2018/08/23 12.27 0
288 2018/08/24 12.31 0
289 2018/08/27 12.47 0
290 2018/08/29 12.31 0
291 2018/08/30 12.13 0
292 2018/08/31 11.69 0
293 2018/09/03 11.60 1
294 2018/09/04 11.65 0
295 2018/09/05 11.45 0
296 2018/09/07 11.42 0
297 2018/09/10 10.71 0
298 2018/09/11 10.76 1
299 2018/09/12 10.74 0
300 2018/09/13 10.85 1
301 2018/09/14 10.79 0
302 2018/09/18 10.58 1
303 2018/09/19 10.65 1
304 2018/09/21 10.73 1

Related

Get Row in Other Table

I have a dataframe 'df'. Using the validation data validData, I want to compute the response rate (Florence = 1/Yes) using the rfm_aboveavg (RFM combinations response rates above the overall response). Response rate is given by considering 0/No and 1/Yes, so it would be rfm_crosstab[1] / rfm_crosstab['All'].
Using the results from the validation data, I want to only display the rows that are also shown in the training data output by the RFM column. How do I do this?
Data: 'df'
Seq# ID# Gender M R F FirstPurch ChildBks YouthBks CookBks ... ItalCook ItalAtlas ItalArt Florence Related Purchase Mcode Rcode Fcode Yes_Florence No_Florence
0 1 25 1 297 14 2 22 0 1 1 ... 0 0 0 0 0 5 4 2 0 1
1 2 29 0 128 8 2 10 0 0 0 ... 0 0 0 0 0 4 3 2 0 1
2 3 46 1 138 22 7 56 2 1 2 ... 1 0 0 0 2 4 4 3 0 1
3 4 47 1 228 2 1 2 0 0 0 ... 0 0 0 0 0 5 1 1 0 1
4 5 51 1 257 10 1 10 0 0 0 ... 0 0 0 0 0 5 3 1 0 1
My code: Crosstab for training data trainData
trainData, validData = train_test_split(df, test_size=0.4, random_state=1)
# Response rate for training data as a whole
responseRate = (sum(trainData.Florence == 1) / sum(trainData.Florence == 0)) * 100
# Response rate for RFM categories
# RFM: Combine R, F, M categories into one category
trainData['RFM'] = trainData['Mcode'].astype(str) + trainData['Rcode'].astype(str) + trainData['Fcode'].astype(str)
rfm_crosstab = pd.crosstab(index = [trainData['RFM']], columns = trainData['Florence'], margins = True)
rfm_crosstab['Percentage of 1/Yes'] = 100 * (rfm_crosstab[1] / rfm_crosstab['All'])
# RFM combinations response rates above the overall response
rfm_aboveavg = rfm_crosstab['Percentage of 1/Yes'] > responseRate
rfm_crosstab[rfm_aboveavg]
Output: Training data
Florence 0 1 All Percentage of 1/Yes
RFM
121 3 2 5 40.000000
131 9 1 10 10.000000
212 1 2 3 66.666667
221 6 3 9 33.333333
222 6 1 7 14.285714
313 2 1 3 33.333333
321 17 3 20 15.000000
322 20 4 24 16.666667
323 2 1 3 33.333333
341 61 10 71 14.084507
343 17 2 19 10.526316
411 12 3 15 20.000000
422 26 5 31 16.129032
423 32 8 40 20.000000
441 96 12 108 11.111111
511 19 4 23 17.391304
513 44 8 52 15.384615
521 24 5 29 17.241379
523 74 16 90 17.777778
533 177 28 205 13.658537
My code: Crosstab for validation data validData
# Response rate for RFM categories
# RFM: Combine R, F, M categories into one category
validData['RFM'] = validData['Mcode'].astype(str) + validData['Rcode'].astype(str) + validData['Fcode'].astype(str)
rfm_crosstab1 = pd.crosstab(index = [validData['RFM']], columns = validData['Florence'], margins = True)
rfm_crosstab1['Percentage of 1/Yes'] = 100 * (rfm_crosstab1[1] / rfm_crosstab1['All'])
rfm_crosstab1
Output: Validation data
Florence 0 1 All Percentage of 1/Yes
RFM
131 3 1 4 25.000000
141 8 0 8 0.000000
211 2 1 3 33.333333
212 2 0 2 0.000000
213 0 1 1 100.000000
221 5 0 5 0.000000
222 2 0 2 0.000000
231 21 1 22 4.545455
232 3 0 3 0.000000
233 1 0 1 0.000000
241 11 1 12 8.333333
242 8 0 8 0.000000
243 2 0 2 0.000000
311 7 0 7 0.000000
312 8 0 8 0.000000
313 1 0 1 0.000000
321 12 0 12 0.000000
322 13 0 13 0.000000
323 4 1 5 20.000000
331 19 1 20 5.000000
332 25 2 27 7.407407
333 11 1 12 8.333333
341 36 2 38 5.263158
342 30 2 32 6.250000
343 12 0 12 0.000000
411 8 2 10 20.000000
412 7 0 7 0.000000
413 13 1 14 7.142857
421 21 2 23 8.695652
422 30 1 31 3.225806
423 26 1 27 3.703704
431 51 3 54 5.555556
432 42 7 49 14.285714
433 41 5 46 10.869565
441 68 2 70 2.857143
442 78 3 81 3.703704
443 70 5 75 6.666667
511 17 0 17 0.000000
512 13 1 14 7.142857
513 26 6 32 18.750000
521 19 1 20 5.000000
522 25 6 31 19.354839
523 50 6 56 10.714286
531 66 3 69 4.347826
532 65 3 68 4.411765
533 128 24 152 15.789474
541 86 7 93 7.526882
542 100 6 106 5.660377
543 178 17 195 8.717949
All 1474 126 1600 7.875000

How to join/merge and sum columns with the same name

How can I merge and sum the columns with the same name?
So the output should be 1 Column named Canada as a result of the sum of the 4 columns named Canada.
Country/Region Brazil Canada Canada Canada Canada
Week 1 0 3 0 0 0
Week 2 0 17 0 0 0
Week 3 0 21 0 0 0
Week 4 0 21 0 0 0
Week 5 0 23 0 0 0
Week 6 0 80 0 5 0
Week 7 0 194 0 20 0
Week 8 12 702 3 199 20
Week 9 182 2679 16 2395 260
Week 10 737 8711 80 17928 892
Week 11 1674 25497 153 48195 1597
Week 12 2923 46392 175 85563 2003
Week 13 4516 76095 182 122431 2180
Week 14 6002 105386 183 163539 2431
Week 15 6751 127713 189 210409 2995
Week 16 7081 147716 189 258188 3845
From its current state, this should give the outcome you're looking for:
df = df.set_index('Country/Region') # optional
df.groupby(df.columns, axis=1).sum() # Stolen from Scott Boston as it's a superior method.
Output:
index Brazil Canada
Country/Region
Week 1 0 3
Week 2 0 17
Week 3 0 21
Week 4 0 21
Week 5 0 23
Week 6 0 85
Week 7 0 214
Week 8 12 924
Week 9 182 5350
Week 10 737 27611
Week 11 1674 75442
Week 12 2923 134133
Week 13 4516 200888
Week 14 6002 271539
Week 15 6751 341306
Week 16 7081 409938
I found your dataset interesting, here's how I would clean it up from step 1:
df = pd.read_csv('file.csv')
df = df.set_index(['Province/State', 'Country/Region', 'Lat', 'Long']).stack().reset_index()
df.columns = ['Province/State', 'Country/Region', 'Lat', 'Long', 'date', 'value']
df['date'] = pd.to_datetime(df['date'])
df = df.set_index('date')
df = df.pivot_table(index=df.index, columns='Country/Region', values='value', aggfunc=np.sum)
print(df)
Output:
Country/Region Afghanistan Albania Algeria Andorra Angola ... West Bank and Gaza Western Sahara Yemen Zambia Zimbabwe
date ...
2020-01-22 0 0 0 0 0 ... 0 0 0 0 0
2020-01-23 0 0 0 0 0 ... 0 0 0 0 0
2020-01-24 0 0 0 0 0 ... 0 0 0 0 0
2020-01-25 0 0 0 0 0 ... 0 0 0 0 0
2020-01-26 0 0 0 0 0 ... 0 0 0 0 0
... ... ... ... ... ... ... ... ... ... ... ...
2020-07-30 36542 5197 29831 922 1109 ... 11548 10 1726 5555 3092
2020-07-31 36675 5276 30394 925 1148 ... 11837 10 1728 5963 3169
2020-08-01 36710 5396 30950 925 1164 ... 12160 10 1730 6228 3659
2020-08-02 36710 5519 31465 925 1199 ... 12297 10 1734 6347 3921
2020-08-03 36747 5620 31972 937 1280 ... 12541 10 1734 6580 4075
If you now want to do weekly aggregations, it's as simple as:
print(df.resample('w').sum())
Output:
Country/Region Afghanistan Albania Algeria Andorra Angola ... West Bank and Gaza Western Sahara Yemen Zambia Zimbabwe
date ...
2020-01-26 0 0 0 0 0 ... 0 0 0 0 0
2020-02-02 0 0 0 0 0 ... 0 0 0 0 0
2020-02-09 0 0 0 0 0 ... 0 0 0 0 0
2020-02-16 0 0 0 0 0 ... 0 0 0 0 0
2020-02-23 0 0 0 0 0 ... 0 0 0 0 0
2020-03-01 7 0 6 0 0 ... 0 0 0 0 0
2020-03-08 10 0 85 7 0 ... 43 0 0 0 0
2020-03-15 57 160 195 7 0 ... 209 0 0 0 0
2020-03-22 175 464 705 409 5 ... 309 0 0 11 7
2020-03-29 632 1142 2537 1618 29 ... 559 0 0 113 31
2020-04-05 1783 2000 6875 2970 62 ... 1178 4 0 262 59
2020-04-12 3401 2864 11629 4057 128 ... 1847 30 3 279 84
2020-04-19 5838 3603 16062 4764 143 ... 2081 42 7 356 154
2020-04-26 8918 4606 21211 5087 174 ... 2353 42 7 541 200
2020-05-03 15149 5391 27943 5214 208 ... 2432 42 41 738 244
2020-05-10 25286 5871 36315 5265 274 ... 2607 42 203 1260 241
2020-05-17 39634 6321 45122 5317 327 ... 2632 42 632 3894 274
2020-05-24 61342 6798 54185 5332 402 ... 2869 45 1321 5991 354
2020-05-31 91885 7517 62849 5344 536 ... 3073 63 1932 7125 894
2020-06-07 126442 8378 68842 5868 609 ... 3221 63 3060 7623 1694
2020-06-14 159822 9689 74147 5967 827 ... 3396 63 4236 8836 2335
2020-06-21 191378 12463 79737 5981 1142 ... 4466 63 6322 9905 3089
2020-06-28 210487 15349 87615 5985 1522 ... 10242 70 7360 10512 3813
2020-07-05 224560 18707 102918 5985 2186 ... 21897 70 8450 11322 4426
2020-07-12 237087 22399 124588 5985 2940 ... 36949 70 9489 13002 6200
2020-07-19 245264 26845 149611 6098 4279 ... 52323 70 10855 16350 9058
2020-07-26 250970 31255 178605 6237 5919 ... 68154 70 11571 26749 14933
2020-08-02 255739 36370 208457 6429 7648 ... 80685 70 12023 38896 22241
2020-08-09 36747 5620 31972 937 1280 ... 12541 10 1734 6580 4075
Try:
np.random.seed(0)
df = pd.DataFrame(np.random.randint(0,100,(20,5)), columns=[*'ZAABC'])
df.groupby(df.columns, axis=1, sort=False).sum()
Output:
Z A B C
0 44 111 67 67
1 9 104 36 87
2 70 176 12 58
3 65 126 46 88
4 81 62 77 72
5 9 100 69 79
6 47 146 99 88
7 49 48 19 14
8 39 97 9 57
9 32 105 23 35
10 75 83 34 0
11 0 89 5 38
12 17 83 42 58
13 31 66 41 57
14 35 57 82 91
15 0 113 53 12
16 42 159 68 6
17 68 50 76 52
18 78 35 99 58
19 23 92 85 48
You can try a transpose and groupby, e.g. something similar to the below.
df_T = df.tranpose()
df_T.groupby(df_T.index).sum()['Canada']
Here's a way to do it:
df.columns = [(col + str(i)) if col.startswith('Canada') else col for i, col in enumerate(df.columns)]
df = df.assign(Canada=df.filter(like='Canada').sum(axis=1)).drop(columns=[x for x in df.columns if x.startswith('Canada') and x != 'Canada'])
First we rename the columns starting with Canada by appending their integer position, which ensures they are no longer duplicates.
Then we use sum() to add across columns like Canada, put the result in a new column named Canada, and drop the columns that were originally named Canada.
Full test code is:
import pandas as pd
df = pd.DataFrame(
columns=[x.strip() for x in 'Brazil Canada Canada Canada Canada'.split()],
index=['Week ' + str(i) for i in range(1, 17)],
data=[[i] * 5 for i in range(1, 17)])
df.columns.names=['Country/Region']
print(df)
df.columns = [(col + str(i)) if col.startswith('Canada') else col for i, col in enumerate(df.columns)]
df = df.assign(Canada=df.filter(like='Canada').sum(axis=1)).drop(columns=[x for x in df.columns if x.startswith('Canada') and x != 'Canada'])
print(df)
Output:
Country/Region Brazil Canada Canada Canada Canada
Week 1 1 1 1 1 1
Week 2 2 2 2 2 2
Week 3 3 3 3 3 3
Week 4 4 4 4 4 4
Week 5 5 5 5 5 5
Week 6 6 6 6 6 6
Week 7 7 7 7 7 7
Week 8 8 8 8 8 8
Week 9 9 9 9 9 9
Week 10 10 10 10 10 10
Week 11 11 11 11 11 11
Week 12 12 12 12 12 12
Week 13 13 13 13 13 13
Week 14 14 14 14 14 14
Week 15 15 15 15 15 15
Week 16 16 16 16 16 16
Brazil Canada
Week 1 1 4
Week 2 2 8
Week 3 3 12
Week 4 4 16
Week 5 5 20
Week 6 6 24
Week 7 7 28
Week 8 8 32
Week 9 9 36
Week 10 10 40
Week 11 11 44
Week 12 12 48
Week 13 13 52
Week 14 14 56
Week 15 15 60
Week 16 16 64

How to write this iteration?

I want estimate the strategy I make:
buy- where the K_Class is 1
sell- where the K_Class is 0
all prices is refered to Close Column at the time
for example:
Suppose that I have the amount of money 10000, the first time I buy is 2017/03/13, the first time I sell is 2017/03/17. The second time I buy is 2017/03/20, the second time I sell is on 2017/03/22
My Question: Till the end, how do I calculate the amount of money?
Time Close K_Class
0 2017/03/06 31.72 0
1 2017/03/08 33.99 0
2 2017/03/09 32.02 0
3 2017/03/10 30.66 0
4 2017/03/13 30.94 1
5 2017/03/15 32.56 1
6 2017/03/17 33.31 0
7 2017/03/20 34.07 1
8 2017/03/22 34.40 0
9 2017/03/24 32.98 1
10 2017/03/27 33.26 0
11 2017/03/28 31.60 0
12 2017/03/29 30.36 0
13 2017/03/30 28.83 0
14 2017/04/11 27.01 0
15 2017/04/12 24.31 0
16 2017/04/14 24.22 0
17 2017/04/17 21.80 0
18 2017/04/18 21.20 1
19 2017/04/19 23.32 1
20 2017/04/20 24.43 0
21 2017/04/24 23.85 1
22 2017/04/26 23.97 1
23 2017/04/27 24.31 1
24 2017/04/28 23.50 1
25 2017/05/02 22.57 1
26 2017/05/03 22.67 1
27 2017/05/04 22.11 1
28 2017/05/05 21.26 1
29 2017/05/08 19.37 1
.. ... ... ...
275 2018/08/01 13.38 0
276 2018/08/03 12.49 0
277 2018/08/06 12.50 0
278 2018/08/07 12.78 0
279 2018/08/09 12.93 0
280 2018/08/10 13.15 0
281 2018/08/13 13.14 1
282 2018/08/14 13.15 0
283 2018/08/15 12.80 0
284 2018/08/17 12.29 0
285 2018/08/21 12.39 0
286 2018/08/22 12.15 0
287 2018/08/23 12.27 0
288 2018/08/24 12.31 0
289 2018/08/27 12.47 0
290 2018/08/29 12.31 0
291 2018/08/30 12.13 0
292 2018/08/31 11.69 0
293 2018/09/03 11.60 1
294 2018/09/04 11.65 0
295 2018/09/05 11.45 0
296 2018/09/07 11.42 0
297 2018/09/10 10.71 0
298 2018/09/11 10.76 1
299 2018/09/12 10.74 0
300 2018/09/13 10.85 1
301 2018/09/14 10.79 0
302 2018/09/18 10.58 1
303 2018/09/19 10.65 1
304 2018/09/21 10.73 1
You can start with this:
df = pd.DataFrame({'price':np.arange(10), 'class':np.random.randint(2, size=10)})
df['diff'] = -1 * df['class'].diff()
df.loc[0,['diff']] = -1 * df.loc[0,['class']].values
df['money'] = df['price']*df['diff']
so the 'diff' represent the buy and sell action (-1 for buy and +1 for sell). The product of it and the price gives the changes of money you have. Sum it up, plus your initial money, you'll get your final money.
df['diff'] = df['K_Class'].diff()
stock_sell = 0
current_amount = 10000
for n in range(0, df.index.size-1):
print(n)
if df.iloc[n, 10] == 1:
stock_sell = current_amount/df.iloc[n, 4]
if df.iloc[n, 10] == -1:
current_amount = stock_sell*df.iloc[n, 4]
print(current_amount)

group by within group by in pandas

Consider the following dataset:
min 5-min a
0 0 800
0 0 801
1 0 802
1 0 803
1 0 804
2 0 805
2 0 805
2 0 810
3 0 801
3 0 802
3 0 803
4 0 804
4 0 805
5 1 806
5 1 800
5 1 890
6 1 890
6 1 880
6 1 800
7 1 804
7 1 806
8 1 801
9 1 800
9 1 900
10 1 770
10 1 803
10 1 811
I need to calculate std of a on each group based on the minute and then calculate the mean of the results values in each group of 5 min.
I do not know how to find the border of 5 min, after calculation of std.
How should I save the data to know which std belong to each group of 5 min?
data.groupby('minute').a.std()
I would appreciate of any help.
Taskos answer is great but I wasn't sure if you needed the data to be pushed back into the dataframe or not. Assuming what you want is to add the new columns in the parent after each groupby operation, Ive opted to do that for you as follows
import pandas as pd
df = your_df
# First we create the standard deviation column
def add_std(grp):
grp['stdevs'] = grp['a'].std()
return grp
df = df.groupby('min').apply(add_std)
# Next we create the 5 minute mean column
def add_meandev(grp):
grp['meandev'] = grp['stdevs'].mean()
return grp
print(df.groupby('5-min').apply(add_meandev))
This can be done more elegantly by chaining etc but I have opted to lay it out like this so that the underlying process is more visible to you.
The final output from this will look like the following:
min 5-min a stdevs meandev
0 0 0 800 0.707107 1.345283
1 0 0 801 0.707107 1.345283
2 1 0 802 1.000000 1.345283
3 1 0 803 1.000000 1.345283
4 1 0 804 1.000000 1.345283
5 2 0 805 2.886751 1.345283
6 2 0 805 2.886751 1.345283
7 2 0 810 2.886751 1.345283
8 3 0 801 1.000000 1.345283
9 3 0 802 1.000000 1.345283
10 3 0 803 1.000000 1.345283
11 4 0 804 0.707107 1.345283
12 4 0 805 0.707107 1.345283
13 5 1 806 50.318983 39.107147
14 5 1 800 50.318983 39.107147
15 5 1 890 50.318983 39.107147
16 6 1 890 49.328829 39.107147
17 6 1 880 49.328829 39.107147
18 6 1 800 49.328829 39.107147
19 7 1 804 1.414214 39.107147
20 7 1 806 1.414214 39.107147
21 8 1 801 NaN 39.107147
22 9 1 800 70.710678 39.107147
23 9 1 900 70.710678 39.107147
24 10 1 770 21.733231 39.107147
25 10 1 803 21.733231 39.107147
26 10 1 811 21.733231 39.107147
Not 100% clear on what you are asking... but I think this is what you need:
data.groupby(['min','5-min']).std().groupby('5-min').mean()
This finds the standard deviation based on the 5-min column of the means calculated based on the 'min' column.

Python pandas groupby with cumsum and percentage

Given the following dataframe df:
app platform uuid minutes
0 1 0 a696ccf9-22cb-428b-adee-95c9a97a4581 67
1 2 0 8e17a2eb-f0ee-49ae-b8c2-c9f9926aa56d 1
2 2 1 40AD6CD1-4A7B-48DD-8815-1829C093A95C 13
3 1 0 26c1022a-7a8e-42a2-b7cc-bea6bffa7a6f 2
4 2 0 34271596-eebb-4423-b890-dc3761ed37ca 8
5 3 1 C57D0F52-B565-4322-85D2-C2798F7CA6FF 16
6 2 0 245501ec2e39cb782bab1fb02d7813b7 1
7 3 1 DE6E4714-5A3C-4C80-BD81-EAACB2364DF0 30
8 3 0 f88eb774-fdf3-4d1d-a91d-0b4ab95cf36e 10
9 2 0 9c08c860-7a6d-4810-a5c3-f3af2a3fcf66 470
10 3 1 19fdaedfd0dbdaf6a7a6b49619f11a19 3
11 3 1 AAF1CFF7-4564-4C79-B2D8-F0AAF9C9971B 58
12 2 0 4eb1024b-c293-42a4-95a2-31b20c3b524b 24
13 3 1 8E0B0BE3-8553-4F38-9837-6C907E01F84C 7
14 3 1 E8B2849C-F050-4DCD-B311-5D57015466AE 465
15 2 0 ec7fedb6-b118-424a-babe-b8ffad579685 266
16 1 0 7e302dcb-ceaf-406c-a9e5-66933d921064 184
17 2 0 f786528ded200c9f553dd3a5e9e9bb2d 10
18 3 1 1E291633-AF27-4DFB-8DA4-4A5B63F175CF 13
19 2 0 953a525c-97e0-4c2f-90e0-dfebde3ec20d 2408`
I'll group it:
y=df.groupby(['app','platform','uuid']).sum().reset_index().sort(['app','platform','minutes'],ascending=[1,1,0]).set_index(['app','platform','uuid'])
minutes
app platform uuid
1 0 7e302dcb-ceaf-406c-a9e5-66933d921064 184
a696ccf9-22cb-428b-adee-95c9a97a4581 67
26c1022a-7a8e-42a2-b7cc-bea6bffa7a6f 2
2 0 953a525c-97e0-4c2f-90e0-dfebde3ec20d 2408
9c08c860-7a6d-4810-a5c3-f3af2a3fcf66 470
ec7fedb6-b118-424a-babe-b8ffad579685 266
4eb1024b-c293-42a4-95a2-31b20c3b524b 24
f786528ded200c9f553dd3a5e9e9bb2d 10
34271596-eebb-4423-b890-dc3761ed37ca 8
245501ec2e39cb782bab1fb02d7813b7 1
8e17a2eb-f0ee-49ae-b8c2-c9f9926aa56d 1
1 40AD6CD1-4A7B-48DD-8815-1829C093A95C 13
3 0 f88eb774-fdf3-4d1d-a91d-0b4ab95cf36e 10
1 E8B2849C-F050-4DCD-B311-5D57015466AE 465
AAF1CFF7-4564-4C79-B2D8-F0AAF9C9971B 58
DE6E4714-5A3C-4C80-BD81-EAACB2364DF0 30
C57D0F52-B565-4322-85D2-C2798F7CA6FF 16
1E291633-AF27-4DFB-8DA4-4A5B63F175CF 13
8E0B0BE3-8553-4F38-9837-6C907E01F84C 7
19fdaedfd0dbdaf6a7a6b49619f11a19 3
So that I got its minutes per uuid in decrescent order.
Now, I will sum the cumulative minutes per app/platform/uuid:
y.groupby(level=[0,1]).cumsum()
app platform uuid
1 0 7e302dcb-ceaf-406c-a9e5-66933d921064 184
a696ccf9-22cb-428b-adee-95c9a97a4581 251
26c1022a-7a8e-42a2-b7cc-bea6bffa7a6f 253
2 0 953a525c-97e0-4c2f-90e0-dfebde3ec20d 2408
9c08c860-7a6d-4810-a5c3-f3af2a3fcf66 2878
ec7fedb6-b118-424a-babe-b8ffad579685 3144
4eb1024b-c293-42a4-95a2-31b20c3b524b 3168
f786528ded200c9f553dd3a5e9e9bb2d 3178
34271596-eebb-4423-b890-dc3761ed37ca 3186
245501ec2e39cb782bab1fb02d7813b7 3187
8e17a2eb-f0ee-49ae-b8c2-c9f9926aa56d 3188
1 40AD6CD1-4A7B-48DD-8815-1829C093A95C 13
3 0 f88eb774-fdf3-4d1d-a91d-0b4ab95cf36e 10
1 E8B2849C-F050-4DCD-B311-5D57015466AE 465
AAF1CFF7-4564-4C79-B2D8-F0AAF9C9971B 523
DE6E4714-5A3C-4C80-BD81-EAACB2364DF0 553
C57D0F52-B565-4322-85D2-C2798F7CA6FF 569
1E291633-AF27-4DFB-8DA4-4A5B63F175CF 582
8E0B0BE3-8553-4F38-9837-6C907E01F84C 589
19fdaedfd0dbdaf6a7a6b49619f11a19 592
My question is: how can I get the percent agains the total cumulative sum, per group, i.e, something like this:
app platform uuid
1 0 7e302dcb-ceaf-406c-a9e5-66933d921064 184 0.26
a696ccf9-22cb-428b-adee-95c9a97a4581 251 0.36
26c1022a-7a8e-42a2-b7cc-bea6bffa7a6f 253 0.36
...
...
...
It's not clear you came up with 0.26, 0.36 in your desired output - but assuming those are just dummy numbers, to get a running % of total for each group, you could do this:
y['cumsum'] = y.groupby(level=[0,1]).cumsum()
y['running_pct'] = y.groupby(level=[0,1])['cumsum'].transform(lambda x: x / x.iloc[-1])
Should give the right output.
In [398]: y['running_pct'].head()
Out[398]:
app platform uuid
1 0 7e302dcb-ceaf-406c-a9e5-66933d921064 0.727273
a696ccf9-22cb-428b-adee-95c9a97a4581 0.992095
26c1022a-7a8e-42a2-b7cc-bea6bffa7a6f 1.000000
2 0 953a525c-97e0-4c2f-90e0-dfebde3ec20d 0.755332
9c08c860-7a6d-4810-a5c3-f3af2a3fcf66 0.902760
Name: running_pct, dtype: float64
EDIT:
Per the comments, if you're looking to wring out a little more performance, this will be faster as of version 0.14.1
y['cumsum'] = y.groupby(level=[0,1])['minutes'].transform('cumsum')
y['running_pct'] = y['cumsum'] / y.groupby(level=[0,1])['minutes'].transform('sum')
And as #Jeff notes, in 0.15.0 this may be faster yet.
y['running_pct'] = y['cumsum'] / y.groupby(level=[0,1])['minutes'].transform('last')

Categories