group by within group by in pandas

group by within group by in pandas - python

Consider the following dataset:
min 5-min a
0 0 800
0 0 801
1 0 802
1 0 803
1 0 804
2 0 805
2 0 805
2 0 810
3 0 801
3 0 802
3 0 803
4 0 804
4 0 805
5 1 806
5 1 800
5 1 890
6 1 890
6 1 880
6 1 800
7 1 804
7 1 806
8 1 801
9 1 800
9 1 900
10 1 770
10 1 803
10 1 811
I need to calculate std of a on each group based on the minute and then calculate the mean of the results values in each group of 5 min.
I do not know how to find the border of 5 min, after calculation of std.
How should I save the data to know which std belong to each group of 5 min?
data.groupby('minute').a.std()
I would appreciate of any help.

Taskos answer is great but I wasn't sure if you needed the data to be pushed back into the dataframe or not. Assuming what you want is to add the new columns in the parent after each groupby operation, Ive opted to do that for you as follows
import pandas as pd
df = your_df
# First we create the standard deviation column
def add_std(grp):
grp['stdevs'] = grp['a'].std()
return grp
df = df.groupby('min').apply(add_std)
# Next we create the 5 minute mean column
def add_meandev(grp):
grp['meandev'] = grp['stdevs'].mean()
return grp
print(df.groupby('5-min').apply(add_meandev))
This can be done more elegantly by chaining etc but I have opted to lay it out like this so that the underlying process is more visible to you.
The final output from this will look like the following:
min 5-min a stdevs meandev
0 0 0 800 0.707107 1.345283
1 0 0 801 0.707107 1.345283
2 1 0 802 1.000000 1.345283
3 1 0 803 1.000000 1.345283
4 1 0 804 1.000000 1.345283
5 2 0 805 2.886751 1.345283
6 2 0 805 2.886751 1.345283
7 2 0 810 2.886751 1.345283
8 3 0 801 1.000000 1.345283
9 3 0 802 1.000000 1.345283
10 3 0 803 1.000000 1.345283
11 4 0 804 0.707107 1.345283
12 4 0 805 0.707107 1.345283
13 5 1 806 50.318983 39.107147
14 5 1 800 50.318983 39.107147
15 5 1 890 50.318983 39.107147
16 6 1 890 49.328829 39.107147
17 6 1 880 49.328829 39.107147
18 6 1 800 49.328829 39.107147
19 7 1 804 1.414214 39.107147
20 7 1 806 1.414214 39.107147
21 8 1 801 NaN 39.107147
22 9 1 800 70.710678 39.107147
23 9 1 900 70.710678 39.107147
24 10 1 770 21.733231 39.107147
25 10 1 803 21.733231 39.107147
26 10 1 811 21.733231 39.107147

Not 100% clear on what you are asking... but I think this is what you need:
data.groupby(['min','5-min']).std().groupby('5-min').mean()
This finds the standard deviation based on the 5-min column of the means calculated based on the 'min' column.

Related

Get Row in Other Table

I have a dataframe 'df'. Using the validation data validData, I want to compute the response rate (Florence = 1/Yes) using the rfm_aboveavg (RFM combinations response rates above the overall response). Response rate is given by considering 0/No and 1/Yes, so it would be rfm_crosstab[1] / rfm_crosstab['All'].
Using the results from the validation data, I want to only display the rows that are also shown in the training data output by the RFM column. How do I do this?
Data: 'df'
Seq# ID# Gender M R F FirstPurch ChildBks YouthBks CookBks ... ItalCook ItalAtlas ItalArt Florence Related Purchase Mcode Rcode Fcode Yes_Florence No_Florence
0 1 25 1 297 14 2 22 0 1 1 ... 0 0 0 0 0 5 4 2 0 1
1 2 29 0 128 8 2 10 0 0 0 ... 0 0 0 0 0 4 3 2 0 1
2 3 46 1 138 22 7 56 2 1 2 ... 1 0 0 0 2 4 4 3 0 1
3 4 47 1 228 2 1 2 0 0 0 ... 0 0 0 0 0 5 1 1 0 1
4 5 51 1 257 10 1 10 0 0 0 ... 0 0 0 0 0 5 3 1 0 1
My code: Crosstab for training data trainData
trainData, validData = train_test_split(df, test_size=0.4, random_state=1)
# Response rate for training data as a whole
responseRate = (sum(trainData.Florence == 1) / sum(trainData.Florence == 0)) * 100
# Response rate for RFM categories
# RFM: Combine R, F, M categories into one category
trainData['RFM'] = trainData['Mcode'].astype(str) + trainData['Rcode'].astype(str) + trainData['Fcode'].astype(str)
rfm_crosstab = pd.crosstab(index = [trainData['RFM']], columns = trainData['Florence'], margins = True)
rfm_crosstab['Percentage of 1/Yes'] = 100 * (rfm_crosstab[1] / rfm_crosstab['All'])
# RFM combinations response rates above the overall response
rfm_aboveavg = rfm_crosstab['Percentage of 1/Yes'] > responseRate
rfm_crosstab[rfm_aboveavg]
Output: Training data
Florence 0 1 All Percentage of 1/Yes
RFM
121 3 2 5 40.000000
131 9 1 10 10.000000
212 1 2 3 66.666667
221 6 3 9 33.333333
222 6 1 7 14.285714
313 2 1 3 33.333333
321 17 3 20 15.000000
322 20 4 24 16.666667
323 2 1 3 33.333333
341 61 10 71 14.084507
343 17 2 19 10.526316
411 12 3 15 20.000000
422 26 5 31 16.129032
423 32 8 40 20.000000
441 96 12 108 11.111111
511 19 4 23 17.391304
513 44 8 52 15.384615
521 24 5 29 17.241379
523 74 16 90 17.777778
533 177 28 205 13.658537
My code: Crosstab for validation data validData
# Response rate for RFM categories
# RFM: Combine R, F, M categories into one category
validData['RFM'] = validData['Mcode'].astype(str) + validData['Rcode'].astype(str) + validData['Fcode'].astype(str)
rfm_crosstab1 = pd.crosstab(index = [validData['RFM']], columns = validData['Florence'], margins = True)
rfm_crosstab1['Percentage of 1/Yes'] = 100 * (rfm_crosstab1[1] / rfm_crosstab1['All'])
rfm_crosstab1
Output: Validation data
Florence 0 1 All Percentage of 1/Yes
RFM
131 3 1 4 25.000000
141 8 0 8 0.000000
211 2 1 3 33.333333
212 2 0 2 0.000000
213 0 1 1 100.000000
221 5 0 5 0.000000
222 2 0 2 0.000000
231 21 1 22 4.545455
232 3 0 3 0.000000
233 1 0 1 0.000000
241 11 1 12 8.333333
242 8 0 8 0.000000
243 2 0 2 0.000000
311 7 0 7 0.000000
312 8 0 8 0.000000
313 1 0 1 0.000000
321 12 0 12 0.000000
322 13 0 13 0.000000
323 4 1 5 20.000000
331 19 1 20 5.000000
332 25 2 27 7.407407
333 11 1 12 8.333333
341 36 2 38 5.263158
342 30 2 32 6.250000
343 12 0 12 0.000000
411 8 2 10 20.000000
412 7 0 7 0.000000
413 13 1 14 7.142857
421 21 2 23 8.695652
422 30 1 31 3.225806
423 26 1 27 3.703704
431 51 3 54 5.555556
432 42 7 49 14.285714
433 41 5 46 10.869565
441 68 2 70 2.857143
442 78 3 81 3.703704
443 70 5 75 6.666667
511 17 0 17 0.000000
512 13 1 14 7.142857
513 26 6 32 18.750000
521 19 1 20 5.000000
522 25 6 31 19.354839
523 50 6 56 10.714286
531 66 3 69 4.347826
532 65 3 68 4.411765
533 128 24 152 15.789474
541 86 7 93 7.526882
542 100 6 106 5.660377
543 178 17 195 8.717949
All 1474 126 1600 7.875000

Transform each group in a DataFrame

I have the following DataFrame:
id x y timestamp sensorTime
1 32 30 1031 2002
1 4 105 1035 2005
1 8 110 1050 2006
2 18 10 1500 3600
2 40 20 1550 3610
2 80 10 1450 3620
....
import pandas as pd
import numpy as np
df = pd.DataFrame(np.array([[1,1,1,2,2,2], [32,4,8,18,40,80], [30,105,110,10,20,10], [1031,1035,1050,1500,1550,1450], [2002, 2005, 2006, 3600, 3610, 3620]])).T
df.columns = ['id', 'x', 'y', 'timestamp', 'sensorTime]
For each group grouped by id I would like to add the differences of the sensorTime to the first value of timestamp. Something like the following:
start = df.iloc[0]['timestamp']
df['sensorTime'] -= df.iloc[0]['sensorTime']
df['sensorTime'] += start
But I would like to do this for each id group separately.
The resulting DataFrame should be:
id x y timestamp sensorTime
1 32 30 1031 1031
1 4 105 1035 1034
1 8 110 1050 1035
2 18 10 1500 1500
2 40 20 1550 1510
2 80 10 1450 1520
....
How can this operation done per group?

df
id x y timestamp sensorTime
0 1 32 30 1031 2002
1 1 4 105 1035 2005
2 1 8 110 1050 2006
3 2 18 10 1500 3600
4 2 40 20 1550 3610
5 2 80 10 1450 3620
You can group by id and then pass both timestamp and sensorTime. Then you can use diff to get the difference of sensorTime. The first value would be NaN and you can replace it with the first value of timestamp of that group. Then you can simply do cumsum to get the desired output.
def func(x):
diff = x['sensorTime'].diff()
diff.iloc[0] = x['timestamp'].iloc[0]
return (diff.cumsum().to_frame())
df['sensorTime'] = df.groupby('id')[['timestamp', 'sensorTime']].apply(func)
df
id x y timestamp sensorTime
0 1 32 30 1031 1031.0
1 1 4 105 1035 1034.0
2 1 8 110 1050 1035.0
3 2 18 10 1500 1500.0
4 2 40 20 1550 1510.0
5 2 80 10 1450 1520.0

You could run a groupby twice, first, to get the difference in sensorTime, the second time to do the cumulative sum:
box = df.groupby("id").sensorTime.transform("diff")
df.assign(
new_sensorTime=np.where(box.isna(), df.timestamp, box),
new=lambda x: x.groupby("id")["new_sensorTime"].cumsum(),
).drop(columns="new_sensorTime")
id x y timestamp sensorTime new
0 1 32 30 1031 2002 1031.0
1 1 4 105 1035 2005 1034.0
2 1 8 110 1050 2006 1035.0
3 2 18 10 1500 3600 1500.0
4 2 40 20 1550 3610 1510.0
5 2 80 10 1450 3620 1520.0

How to do Stock Trading Back Test Using Pandas and Basic Iteration?

I want to estimate a trading strategy, given the amount I invest in a particular stock. Basically when I see "K-Class" is 1, I buy, when I see "K-Class" is 0, I sell. To make that simple engough, we can ignore the open, high, low value. just use the close price to estimate.
We do want to iterate the whole Series, following 1=buy 0=sell, no matter it is right or wrong.
I got a pandas DataFrame with a Series called "K-Class", a boolean, just say 1(buy) and 0(sell)
From the first day the 'K-class' appears 1, I buy, if the second day is 0, I sell 'immediately' at the close price
How can I write a for loop to test the afterall invest money and invest time?(using pandas and python technics)
Pleas feel free to add more variables
I got a
invest_amount = 10000
stock_owned = 10000/ p1 #the first day appears 1, return the close price
invest_time = 0
Time Close K_Class
0 2017/03/06 31.72 0
1 2017/03/08 33.99 0
2 2017/03/09 32.02 0
3 2017/03/10 30.66 0
4 2017/03/13 30.94 1
5 2017/03/15 32.56 1
6 2017/03/17 33.31 0
7 2017/03/20 34.07 1
8 2017/03/22 34.40 0
9 2017/03/24 32.98 1
10 2017/03/27 33.26 0
11 2017/03/28 31.60 0
12 2017/03/29 30.36 0
13 2017/03/30 28.83 0
14 2017/04/11 27.01 0
15 2017/04/12 24.31 0
16 2017/04/14 24.22 0
17 2017/04/17 21.80 0
18 2017/04/18 21.20 1
19 2017/04/19 23.32 1
20 2017/04/20 24.43 0
21 2017/04/24 23.85 1
22 2017/04/26 23.97 1
23 2017/04/27 24.31 1
24 2017/04/28 23.50 1
25 2017/05/02 22.57 1
26 2017/05/03 22.67 1
27 2017/05/04 22.11 1
28 2017/05/05 21.26 1
29 2017/05/08 19.37 1
.. ... ... ...
275 2018/08/01 13.38 0
276 2018/08/03 12.49 0
277 2018/08/06 12.50 0
278 2018/08/07 12.78 0
279 2018/08/09 12.93 0
280 2018/08/10 13.15 0
281 2018/08/13 13.14 1
282 2018/08/14 13.15 0
283 2018/08/15 12.80 0
284 2018/08/17 12.29 0
285 2018/08/21 12.39 0
286 2018/08/22 12.15 0
287 2018/08/23 12.27 0
288 2018/08/24 12.31 0
289 2018/08/27 12.47 0
290 2018/08/29 12.31 0
291 2018/08/30 12.13 0
292 2018/08/31 11.69 0
293 2018/09/03 11.60 1
294 2018/09/04 11.65 0
295 2018/09/05 11.45 0
296 2018/09/07 11.42 0
297 2018/09/10 10.71 0
298 2018/09/11 10.76 1
299 2018/09/12 10.74 0
300 2018/09/13 10.85 1
301 2018/09/14 10.79 0
302 2018/09/18 10.58 1
303 2018/09/19 10.65 1
304 2018/09/21 10.73 1

exclude row for rolling mean calculation in pandas

I am looking for Pandas way to solve this, I have a DataFrame as
df
A RM
0 384 NaN
1 376 380.0
2 399 387.5
3 333 366.0
4 393 363.0
5 323 358.0
6 510 416.5
7 426 468.0
8 352 389.0
I want to see if value in df['A'] > [Previous] RM value then new column Status should have 0 updated else
A RM Status
0 384 NaN 0
1 376 380.0 1
2 399 387.5 0
3 333 366.0 1
4 393 363.0 0
5 323 358.0 1
6 510 416.5 0
7 426 468.0 0
8 352 389.0 1
I suppose i need to use Shift with numpy where, but I am not getting as desired.
import pandas as pd
import numpy as np
df=pd.DataFrame([384,376,399,333,393,323,510,426,352], columns=['A'])
df['RM']=df['A'].rolling(window=2,center=False).mean()
df['Status'] = np.where((df.A > df.RM.shift(1).rolling(window=2,center=False).mean()) , 0, 1)
Finally, applying rolling mean
df.AverageMean=df[df['Status'] == 1]['A'].rolling(window=2,center=False).mean()

Just simple shift
df['Status']=(df.A<=df.RM.fillna(9999).shift()).astype(int)
df
Out[347]:
A RM Status
0 384 NaN 0
1 376 380.0 1
2 399 387.5 0
3 333 366.0 1
4 393 363.0 0
5 323 358.0 1
6 510 416.5 0
7 426 468.0 0
8 352 389.0 1

i assume when you compare with na it always be 1
df['Status'] = (df.A < df.RM.fillna(df.A.max()+1).shift(1)).astype(int)
A RM Status
0 384 NaN 0
1 376 380.0 1
2 399 387.5 0
3 333 366.0 1
4 393 363.0 0
5 323 358.0 1
6 510 416.5 0
7 426 468.0 0
8 352 389.0 1

Python pandas groupby with cumsum and percentage

Given the following dataframe df:
app platform uuid minutes
0 1 0 a696ccf9-22cb-428b-adee-95c9a97a4581 67
1 2 0 8e17a2eb-f0ee-49ae-b8c2-c9f9926aa56d 1
2 2 1 40AD6CD1-4A7B-48DD-8815-1829C093A95C 13
3 1 0 26c1022a-7a8e-42a2-b7cc-bea6bffa7a6f 2
4 2 0 34271596-eebb-4423-b890-dc3761ed37ca 8
5 3 1 C57D0F52-B565-4322-85D2-C2798F7CA6FF 16
6 2 0 245501ec2e39cb782bab1fb02d7813b7 1
7 3 1 DE6E4714-5A3C-4C80-BD81-EAACB2364DF0 30
8 3 0 f88eb774-fdf3-4d1d-a91d-0b4ab95cf36e 10
9 2 0 9c08c860-7a6d-4810-a5c3-f3af2a3fcf66 470
10 3 1 19fdaedfd0dbdaf6a7a6b49619f11a19 3
11 3 1 AAF1CFF7-4564-4C79-B2D8-F0AAF9C9971B 58
12 2 0 4eb1024b-c293-42a4-95a2-31b20c3b524b 24
13 3 1 8E0B0BE3-8553-4F38-9837-6C907E01F84C 7
14 3 1 E8B2849C-F050-4DCD-B311-5D57015466AE 465
15 2 0 ec7fedb6-b118-424a-babe-b8ffad579685 266
16 1 0 7e302dcb-ceaf-406c-a9e5-66933d921064 184
17 2 0 f786528ded200c9f553dd3a5e9e9bb2d 10
18 3 1 1E291633-AF27-4DFB-8DA4-4A5B63F175CF 13
19 2 0 953a525c-97e0-4c2f-90e0-dfebde3ec20d 2408`
I'll group it:
y=df.groupby(['app','platform','uuid']).sum().reset_index().sort(['app','platform','minutes'],ascending=[1,1,0]).set_index(['app','platform','uuid'])
minutes
app platform uuid
1 0 7e302dcb-ceaf-406c-a9e5-66933d921064 184
a696ccf9-22cb-428b-adee-95c9a97a4581 67
26c1022a-7a8e-42a2-b7cc-bea6bffa7a6f 2
2 0 953a525c-97e0-4c2f-90e0-dfebde3ec20d 2408
9c08c860-7a6d-4810-a5c3-f3af2a3fcf66 470
ec7fedb6-b118-424a-babe-b8ffad579685 266
4eb1024b-c293-42a4-95a2-31b20c3b524b 24
f786528ded200c9f553dd3a5e9e9bb2d 10
34271596-eebb-4423-b890-dc3761ed37ca 8
245501ec2e39cb782bab1fb02d7813b7 1
8e17a2eb-f0ee-49ae-b8c2-c9f9926aa56d 1
1 40AD6CD1-4A7B-48DD-8815-1829C093A95C 13
3 0 f88eb774-fdf3-4d1d-a91d-0b4ab95cf36e 10
1 E8B2849C-F050-4DCD-B311-5D57015466AE 465
AAF1CFF7-4564-4C79-B2D8-F0AAF9C9971B 58
DE6E4714-5A3C-4C80-BD81-EAACB2364DF0 30
C57D0F52-B565-4322-85D2-C2798F7CA6FF 16
1E291633-AF27-4DFB-8DA4-4A5B63F175CF 13
8E0B0BE3-8553-4F38-9837-6C907E01F84C 7
19fdaedfd0dbdaf6a7a6b49619f11a19 3
So that I got its minutes per uuid in decrescent order.
Now, I will sum the cumulative minutes per app/platform/uuid:
y.groupby(level=[0,1]).cumsum()
app platform uuid
1 0 7e302dcb-ceaf-406c-a9e5-66933d921064 184
a696ccf9-22cb-428b-adee-95c9a97a4581 251
26c1022a-7a8e-42a2-b7cc-bea6bffa7a6f 253
2 0 953a525c-97e0-4c2f-90e0-dfebde3ec20d 2408
9c08c860-7a6d-4810-a5c3-f3af2a3fcf66 2878
ec7fedb6-b118-424a-babe-b8ffad579685 3144
4eb1024b-c293-42a4-95a2-31b20c3b524b 3168
f786528ded200c9f553dd3a5e9e9bb2d 3178
34271596-eebb-4423-b890-dc3761ed37ca 3186
245501ec2e39cb782bab1fb02d7813b7 3187
8e17a2eb-f0ee-49ae-b8c2-c9f9926aa56d 3188
1 40AD6CD1-4A7B-48DD-8815-1829C093A95C 13
3 0 f88eb774-fdf3-4d1d-a91d-0b4ab95cf36e 10
1 E8B2849C-F050-4DCD-B311-5D57015466AE 465
AAF1CFF7-4564-4C79-B2D8-F0AAF9C9971B 523
DE6E4714-5A3C-4C80-BD81-EAACB2364DF0 553
C57D0F52-B565-4322-85D2-C2798F7CA6FF 569
1E291633-AF27-4DFB-8DA4-4A5B63F175CF 582
8E0B0BE3-8553-4F38-9837-6C907E01F84C 589
19fdaedfd0dbdaf6a7a6b49619f11a19 592
My question is: how can I get the percent agains the total cumulative sum, per group, i.e, something like this:
app platform uuid
1 0 7e302dcb-ceaf-406c-a9e5-66933d921064 184 0.26
a696ccf9-22cb-428b-adee-95c9a97a4581 251 0.36
26c1022a-7a8e-42a2-b7cc-bea6bffa7a6f 253 0.36
...
...
...

It's not clear you came up with 0.26, 0.36 in your desired output - but assuming those are just dummy numbers, to get a running % of total for each group, you could do this:
y['cumsum'] = y.groupby(level=[0,1]).cumsum()
y['running_pct'] = y.groupby(level=[0,1])['cumsum'].transform(lambda x: x / x.iloc[-1])
Should give the right output.
In [398]: y['running_pct'].head()
Out[398]:
app platform uuid
1 0 7e302dcb-ceaf-406c-a9e5-66933d921064 0.727273
a696ccf9-22cb-428b-adee-95c9a97a4581 0.992095
26c1022a-7a8e-42a2-b7cc-bea6bffa7a6f 1.000000
2 0 953a525c-97e0-4c2f-90e0-dfebde3ec20d 0.755332
9c08c860-7a6d-4810-a5c3-f3af2a3fcf66 0.902760
Name: running_pct, dtype: float64
EDIT:
Per the comments, if you're looking to wring out a little more performance, this will be faster as of version 0.14.1
y['cumsum'] = y.groupby(level=[0,1])['minutes'].transform('cumsum')
y['running_pct'] = y['cumsum'] / y.groupby(level=[0,1])['minutes'].transform('sum')
And as #Jeff notes, in 0.15.0 this may be faster yet.
y['running_pct'] = y['cumsum'] / y.groupby(level=[0,1])['minutes'].transform('last')

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

group by within group by in pandas - python

Not 100% clear on what you are asking... but I think this is what you need: data.groupby(['min','5-min']).std().groupby('5-min').mean() This finds the standard deviation based on the 5-min column of the means calculated based on the 'min' column.

Related

Get Row in Other Table

Transform each group in a DataFrame

How to do Stock Trading Back Test Using Pandas and Basic Iteration?

exclude row for rolling mean calculation in pandas

Python pandas groupby with cumsum and percentage

Categories

Resources