exclude row for rolling mean calculation in pandas - python

I am looking for Pandas way to solve this, I have a DataFrame as
df
A RM
0 384 NaN
1 376 380.0
2 399 387.5
3 333 366.0
4 393 363.0
5 323 358.0
6 510 416.5
7 426 468.0
8 352 389.0
I want to see if value in df['A'] > [Previous] RM value then new column Status should have 0 updated else
A RM Status
0 384 NaN 0
1 376 380.0 1
2 399 387.5 0
3 333 366.0 1
4 393 363.0 0
5 323 358.0 1
6 510 416.5 0
7 426 468.0 0
8 352 389.0 1
I suppose i need to use Shift with numpy where, but I am not getting as desired.
import pandas as pd
import numpy as np
df=pd.DataFrame([384,376,399,333,393,323,510,426,352], columns=['A'])
df['RM']=df['A'].rolling(window=2,center=False).mean()
df['Status'] = np.where((df.A > df.RM.shift(1).rolling(window=2,center=False).mean()) , 0, 1)
Finally, applying rolling mean
df.AverageMean=df[df['Status'] == 1]['A'].rolling(window=2,center=False).mean()

Just simple shift
df['Status']=(df.A<=df.RM.fillna(9999).shift()).astype(int)
df
Out[347]:
A RM Status
0 384 NaN 0
1 376 380.0 1
2 399 387.5 0
3 333 366.0 1
4 393 363.0 0
5 323 358.0 1
6 510 416.5 0
7 426 468.0 0
8 352 389.0 1

i assume when you compare with na it always be 1
df['Status'] = (df.A < df.RM.fillna(df.A.max()+1).shift(1)).astype(int)
A RM Status
0 384 NaN 0
1 376 380.0 1
2 399 387.5 0
3 333 366.0 1
4 393 363.0 0
5 323 358.0 1
6 510 416.5 0
7 426 468.0 0
8 352 389.0 1

Related

Get Row in Other Table

I have a dataframe 'df'. Using the validation data validData, I want to compute the response rate (Florence = 1/Yes) using the rfm_aboveavg (RFM combinations response rates above the overall response). Response rate is given by considering 0/No and 1/Yes, so it would be rfm_crosstab[1] / rfm_crosstab['All'].
Using the results from the validation data, I want to only display the rows that are also shown in the training data output by the RFM column. How do I do this?
Data: 'df'
Seq# ID# Gender M R F FirstPurch ChildBks YouthBks CookBks ... ItalCook ItalAtlas ItalArt Florence Related Purchase Mcode Rcode Fcode Yes_Florence No_Florence
0 1 25 1 297 14 2 22 0 1 1 ... 0 0 0 0 0 5 4 2 0 1
1 2 29 0 128 8 2 10 0 0 0 ... 0 0 0 0 0 4 3 2 0 1
2 3 46 1 138 22 7 56 2 1 2 ... 1 0 0 0 2 4 4 3 0 1
3 4 47 1 228 2 1 2 0 0 0 ... 0 0 0 0 0 5 1 1 0 1
4 5 51 1 257 10 1 10 0 0 0 ... 0 0 0 0 0 5 3 1 0 1
My code: Crosstab for training data trainData
trainData, validData = train_test_split(df, test_size=0.4, random_state=1)
# Response rate for training data as a whole
responseRate = (sum(trainData.Florence == 1) / sum(trainData.Florence == 0)) * 100
# Response rate for RFM categories
# RFM: Combine R, F, M categories into one category
trainData['RFM'] = trainData['Mcode'].astype(str) + trainData['Rcode'].astype(str) + trainData['Fcode'].astype(str)
rfm_crosstab = pd.crosstab(index = [trainData['RFM']], columns = trainData['Florence'], margins = True)
rfm_crosstab['Percentage of 1/Yes'] = 100 * (rfm_crosstab[1] / rfm_crosstab['All'])
# RFM combinations response rates above the overall response
rfm_aboveavg = rfm_crosstab['Percentage of 1/Yes'] > responseRate
rfm_crosstab[rfm_aboveavg]
Output: Training data
Florence 0 1 All Percentage of 1/Yes
RFM
121 3 2 5 40.000000
131 9 1 10 10.000000
212 1 2 3 66.666667
221 6 3 9 33.333333
222 6 1 7 14.285714
313 2 1 3 33.333333
321 17 3 20 15.000000
322 20 4 24 16.666667
323 2 1 3 33.333333
341 61 10 71 14.084507
343 17 2 19 10.526316
411 12 3 15 20.000000
422 26 5 31 16.129032
423 32 8 40 20.000000
441 96 12 108 11.111111
511 19 4 23 17.391304
513 44 8 52 15.384615
521 24 5 29 17.241379
523 74 16 90 17.777778
533 177 28 205 13.658537
My code: Crosstab for validation data validData
# Response rate for RFM categories
# RFM: Combine R, F, M categories into one category
validData['RFM'] = validData['Mcode'].astype(str) + validData['Rcode'].astype(str) + validData['Fcode'].astype(str)
rfm_crosstab1 = pd.crosstab(index = [validData['RFM']], columns = validData['Florence'], margins = True)
rfm_crosstab1['Percentage of 1/Yes'] = 100 * (rfm_crosstab1[1] / rfm_crosstab1['All'])
rfm_crosstab1
Output: Validation data
Florence 0 1 All Percentage of 1/Yes
RFM
131 3 1 4 25.000000
141 8 0 8 0.000000
211 2 1 3 33.333333
212 2 0 2 0.000000
213 0 1 1 100.000000
221 5 0 5 0.000000
222 2 0 2 0.000000
231 21 1 22 4.545455
232 3 0 3 0.000000
233 1 0 1 0.000000
241 11 1 12 8.333333
242 8 0 8 0.000000
243 2 0 2 0.000000
311 7 0 7 0.000000
312 8 0 8 0.000000
313 1 0 1 0.000000
321 12 0 12 0.000000
322 13 0 13 0.000000
323 4 1 5 20.000000
331 19 1 20 5.000000
332 25 2 27 7.407407
333 11 1 12 8.333333
341 36 2 38 5.263158
342 30 2 32 6.250000
343 12 0 12 0.000000
411 8 2 10 20.000000
412 7 0 7 0.000000
413 13 1 14 7.142857
421 21 2 23 8.695652
422 30 1 31 3.225806
423 26 1 27 3.703704
431 51 3 54 5.555556
432 42 7 49 14.285714
433 41 5 46 10.869565
441 68 2 70 2.857143
442 78 3 81 3.703704
443 70 5 75 6.666667
511 17 0 17 0.000000
512 13 1 14 7.142857
513 26 6 32 18.750000
521 19 1 20 5.000000
522 25 6 31 19.354839
523 50 6 56 10.714286
531 66 3 69 4.347826
532 65 3 68 4.411765
533 128 24 152 15.789474
541 86 7 93 7.526882
542 100 6 106 5.660377
543 178 17 195 8.717949
All 1474 126 1600 7.875000

How to write this iteration?

I want estimate the strategy I make:
buy- where the K_Class is 1
sell- where the K_Class is 0
all prices is refered to Close Column at the time
for example:
Suppose that I have the amount of money 10000, the first time I buy is 2017/03/13, the first time I sell is 2017/03/17. The second time I buy is 2017/03/20, the second time I sell is on 2017/03/22
My Question: Till the end, how do I calculate the amount of money?
Time Close K_Class
0 2017/03/06 31.72 0
1 2017/03/08 33.99 0
2 2017/03/09 32.02 0
3 2017/03/10 30.66 0
4 2017/03/13 30.94 1
5 2017/03/15 32.56 1
6 2017/03/17 33.31 0
7 2017/03/20 34.07 1
8 2017/03/22 34.40 0
9 2017/03/24 32.98 1
10 2017/03/27 33.26 0
11 2017/03/28 31.60 0
12 2017/03/29 30.36 0
13 2017/03/30 28.83 0
14 2017/04/11 27.01 0
15 2017/04/12 24.31 0
16 2017/04/14 24.22 0
17 2017/04/17 21.80 0
18 2017/04/18 21.20 1
19 2017/04/19 23.32 1
20 2017/04/20 24.43 0
21 2017/04/24 23.85 1
22 2017/04/26 23.97 1
23 2017/04/27 24.31 1
24 2017/04/28 23.50 1
25 2017/05/02 22.57 1
26 2017/05/03 22.67 1
27 2017/05/04 22.11 1
28 2017/05/05 21.26 1
29 2017/05/08 19.37 1
.. ... ... ...
275 2018/08/01 13.38 0
276 2018/08/03 12.49 0
277 2018/08/06 12.50 0
278 2018/08/07 12.78 0
279 2018/08/09 12.93 0
280 2018/08/10 13.15 0
281 2018/08/13 13.14 1
282 2018/08/14 13.15 0
283 2018/08/15 12.80 0
284 2018/08/17 12.29 0
285 2018/08/21 12.39 0
286 2018/08/22 12.15 0
287 2018/08/23 12.27 0
288 2018/08/24 12.31 0
289 2018/08/27 12.47 0
290 2018/08/29 12.31 0
291 2018/08/30 12.13 0
292 2018/08/31 11.69 0
293 2018/09/03 11.60 1
294 2018/09/04 11.65 0
295 2018/09/05 11.45 0
296 2018/09/07 11.42 0
297 2018/09/10 10.71 0
298 2018/09/11 10.76 1
299 2018/09/12 10.74 0
300 2018/09/13 10.85 1
301 2018/09/14 10.79 0
302 2018/09/18 10.58 1
303 2018/09/19 10.65 1
304 2018/09/21 10.73 1
You can start with this:
df = pd.DataFrame({'price':np.arange(10), 'class':np.random.randint(2, size=10)})
df['diff'] = -1 * df['class'].diff()
df.loc[0,['diff']] = -1 * df.loc[0,['class']].values
df['money'] = df['price']*df['diff']
so the 'diff' represent the buy and sell action (-1 for buy and +1 for sell). The product of it and the price gives the changes of money you have. Sum it up, plus your initial money, you'll get your final money.
df['diff'] = df['K_Class'].diff()
stock_sell = 0
current_amount = 10000
for n in range(0, df.index.size-1):
print(n)
if df.iloc[n, 10] == 1:
stock_sell = current_amount/df.iloc[n, 4]
if df.iloc[n, 10] == -1:
current_amount = stock_sell*df.iloc[n, 4]
print(current_amount)

How to do Stock Trading Back Test Using Pandas and Basic Iteration?

I want to estimate a trading strategy, given the amount I invest in a particular stock. Basically when I see "K-Class" is 1, I buy, when I see "K-Class" is 0, I sell. To make that simple engough, we can ignore the open, high, low value. just use the close price to estimate.
We do want to iterate the whole Series, following 1=buy 0=sell, no matter it is right or wrong.
I got a pandas DataFrame with a Series called "K-Class", a boolean, just say 1(buy) and 0(sell)
From the first day the 'K-class' appears 1, I buy, if the second day is 0, I sell 'immediately' at the close price
How can I write a for loop to test the afterall invest money and invest time?(using pandas and python technics)
Pleas feel free to add more variables
I got a
invest_amount = 10000
stock_owned = 10000/ p1 #the first day appears 1, return the close price
invest_time = 0
Time Close K_Class
0 2017/03/06 31.72 0
1 2017/03/08 33.99 0
2 2017/03/09 32.02 0
3 2017/03/10 30.66 0
4 2017/03/13 30.94 1
5 2017/03/15 32.56 1
6 2017/03/17 33.31 0
7 2017/03/20 34.07 1
8 2017/03/22 34.40 0
9 2017/03/24 32.98 1
10 2017/03/27 33.26 0
11 2017/03/28 31.60 0
12 2017/03/29 30.36 0
13 2017/03/30 28.83 0
14 2017/04/11 27.01 0
15 2017/04/12 24.31 0
16 2017/04/14 24.22 0
17 2017/04/17 21.80 0
18 2017/04/18 21.20 1
19 2017/04/19 23.32 1
20 2017/04/20 24.43 0
21 2017/04/24 23.85 1
22 2017/04/26 23.97 1
23 2017/04/27 24.31 1
24 2017/04/28 23.50 1
25 2017/05/02 22.57 1
26 2017/05/03 22.67 1
27 2017/05/04 22.11 1
28 2017/05/05 21.26 1
29 2017/05/08 19.37 1
.. ... ... ...
275 2018/08/01 13.38 0
276 2018/08/03 12.49 0
277 2018/08/06 12.50 0
278 2018/08/07 12.78 0
279 2018/08/09 12.93 0
280 2018/08/10 13.15 0
281 2018/08/13 13.14 1
282 2018/08/14 13.15 0
283 2018/08/15 12.80 0
284 2018/08/17 12.29 0
285 2018/08/21 12.39 0
286 2018/08/22 12.15 0
287 2018/08/23 12.27 0
288 2018/08/24 12.31 0
289 2018/08/27 12.47 0
290 2018/08/29 12.31 0
291 2018/08/30 12.13 0
292 2018/08/31 11.69 0
293 2018/09/03 11.60 1
294 2018/09/04 11.65 0
295 2018/09/05 11.45 0
296 2018/09/07 11.42 0
297 2018/09/10 10.71 0
298 2018/09/11 10.76 1
299 2018/09/12 10.74 0
300 2018/09/13 10.85 1
301 2018/09/14 10.79 0
302 2018/09/18 10.58 1
303 2018/09/19 10.65 1
304 2018/09/21 10.73 1

group by within group by in pandas

Consider the following dataset:
min 5-min a
0 0 800
0 0 801
1 0 802
1 0 803
1 0 804
2 0 805
2 0 805
2 0 810
3 0 801
3 0 802
3 0 803
4 0 804
4 0 805
5 1 806
5 1 800
5 1 890
6 1 890
6 1 880
6 1 800
7 1 804
7 1 806
8 1 801
9 1 800
9 1 900
10 1 770
10 1 803
10 1 811
I need to calculate std of a on each group based on the minute and then calculate the mean of the results values in each group of 5 min.
I do not know how to find the border of 5 min, after calculation of std.
How should I save the data to know which std belong to each group of 5 min?
data.groupby('minute').a.std()
I would appreciate of any help.
Taskos answer is great but I wasn't sure if you needed the data to be pushed back into the dataframe or not. Assuming what you want is to add the new columns in the parent after each groupby operation, Ive opted to do that for you as follows
import pandas as pd
df = your_df
# First we create the standard deviation column
def add_std(grp):
grp['stdevs'] = grp['a'].std()
return grp
df = df.groupby('min').apply(add_std)
# Next we create the 5 minute mean column
def add_meandev(grp):
grp['meandev'] = grp['stdevs'].mean()
return grp
print(df.groupby('5-min').apply(add_meandev))
This can be done more elegantly by chaining etc but I have opted to lay it out like this so that the underlying process is more visible to you.
The final output from this will look like the following:
min 5-min a stdevs meandev
0 0 0 800 0.707107 1.345283
1 0 0 801 0.707107 1.345283
2 1 0 802 1.000000 1.345283
3 1 0 803 1.000000 1.345283
4 1 0 804 1.000000 1.345283
5 2 0 805 2.886751 1.345283
6 2 0 805 2.886751 1.345283
7 2 0 810 2.886751 1.345283
8 3 0 801 1.000000 1.345283
9 3 0 802 1.000000 1.345283
10 3 0 803 1.000000 1.345283
11 4 0 804 0.707107 1.345283
12 4 0 805 0.707107 1.345283
13 5 1 806 50.318983 39.107147
14 5 1 800 50.318983 39.107147
15 5 1 890 50.318983 39.107147
16 6 1 890 49.328829 39.107147
17 6 1 880 49.328829 39.107147
18 6 1 800 49.328829 39.107147
19 7 1 804 1.414214 39.107147
20 7 1 806 1.414214 39.107147
21 8 1 801 NaN 39.107147
22 9 1 800 70.710678 39.107147
23 9 1 900 70.710678 39.107147
24 10 1 770 21.733231 39.107147
25 10 1 803 21.733231 39.107147
26 10 1 811 21.733231 39.107147
Not 100% clear on what you are asking... but I think this is what you need:
data.groupby(['min','5-min']).std().groupby('5-min').mean()
This finds the standard deviation based on the 5-min column of the means calculated based on the 'min' column.

Splitting the header into multiple headers in DataFrame

I have a DataFrame where I need to split the header into multiple rows as headers for the same Dataframe.
The dataframe looks like this,
My data Frame looks like follows,
gene ALL_ID_1 AML_ID_1 AML_ID_2 AML_ID_3 AML_ID_4 AML_ID_5 Stroma_ID_1 Stroma_ID_2 Stroma_ID_3 Stroma_ID_4 Stroma_ID_5 Stroma_CR_Pat_4 Stroma_CR_Pat_5 Stroma_CR_Pat_6 Stroma_CR_Pat_7 Stroma_CR_Pat_8
ENSG 8 1 11 5 10 0 628 542 767 578 462 680 513 968 415 623
ENSG 0 0 1 0 0 0 0 28 1 3 0 1 4 0 0 0
ENSG 661 1418 2580 6817 14727 5968 9 3 5 9 2 9 3 3 5 1
ENSG 20 315 212 8 790 471 1283 2042 1175 2839 1110 857 1880 1526 2262 2624
ENSG 11 26 24 9 11 2 649 532 953 463 468 878 587 245 722 484
And I want the the above header to be spitted as follows,
network ID ID REL
node B_ALL AML Stroma
hemi 1 1 2 3 4 5 1 2 3 4 5 6 7 8 9 10
ENSG 8 1 11 5 10 0 628 542 767 578 462 680 513 968 415 623
ENSG 0 0 1 0 0 0 0 28 1 3 0 1 4 0 0 0
ENSG 661 1418 2580 6817 14727 5968 9 3 5 9 2 9 3 3 5 1
ENSG 20 315 212 8 790 471 1283 2042 1175 2839 1110 857 1880 1526 2262 2624
ENSG 11 26 24 9 11 2 649 532 953 463 468 878 587 245 722 484
Any help is greatly appreciated ..
Probably not the best minimal example you put here, very few people has the subject knowledge to understand what is network, node and hemi in your context.
You just need to create your MultiIndex and replace your column index with the one you created:
There are 3 rules in your example:
1, whenever 'Stroma' is found, the column belongs to REL, otherwise belongs to ID.
2, node is the first field of the initial column names
3, hemi is the last field of the initial column names
Then, just code away:
In [110]:
df.columns = pd.MultiIndex.from_tuples(zip(np.where(df.columns.str.find('Stroma')!=-1, 'REL', 'ID'),
df.columns.map(lambda x: x.split('_')[0]),
df.columns.map(lambda x: x.split('_')[-1])),
names=['network', 'node', 'hemi'])
print df
network ID REL \
node ALL AML Stroma
hemi 1 1 2 3 4 5 1 2 3 4 5
gene
ENSG 8 1 11 5 10 0 628 542 767 578 462
ENSG 0 0 1 0 0 0 0 28 1 3 0
ENSG 661 1418 2580 6817 14727 5968 9 3 5 9 2
ENSG 20 315 212 8 790 471 1283 2042 1175 2839 1110
ENSG 11 26 24 9 11 2 649 532 953 463 468
network
node
hemi 4 5 6 7 8
gene
ENSG 680 513 968 415 623
ENSG 1 4 0 0 0
ENSG 9 3 3 5 1
ENSG 857 1880 1526 2262 2624
ENSG 878 587 245 722 484

Categories