I have a dataframe with depth and other value columns:
data = {'Depth': [1.0, 1.0, 1.5, 2.0, 2.5, 2.5, 3.0, 3.5, 4.0, 4.0, 5.0, 5.5, 6.0],
'Value1':[44, 46, 221, 12, 47, 44, 67, 90, 100, 111, 112, 120, 122],
'Value2': [55, 65, 76, 45, 55, 58, 23, 12, 32, 20, 22, 26, 36]}
df = pd.DataFrame(data)
As you can see sometime there are repetitions in the Depth.
I'd like to be able to somehow groupby intervals and average over them.
For example an output I desire would be:
intervals = [1.0, 2.0]
Taking a list of intervals and breaking up the data set on those intervals to average per value (Value1, Value2) to get:
Depth Value1 Value2 Avg1_1 Avg2_1 Avg1_2 Avg2_2
0 1.0 44 55 80.75 60.25 78.2 .
1 1.0 46 65 80.75 60.25 78.2 .
2 1.5 221 76 80.75 60.25 78.2 .
3 2.0 12 45 80.75 60.25 78.2
4 2.5 47 55 52.67 . 78.2
5 2.5 44 58 52.67 . 78.2
6 3.0 67 23 52.67 . 78.2
7 3.5 90 12 100.33 78.2
8 4.0 100 32 100.33 78.2
9 4.0 111 20 100.33 78.2
10 5.0 112 22 112 .
11 5.5 120 26 121 .
12 6.0 122 36 121 .
Where Avg1_ is the Average of Value1 over every interval of 1.0 (which includes (1.0 - 2.0, 2.5 - 3.0,....etc).
Is there an easy way to do this using groupby in a loop?
You can accomplish this with the dataframe's apply method, and then sampling by boolean values the rows (and associated values) that meet the condition like depth + 1.0 or depth + 2.0.
df['avg1_1'] = df.apply(lambda x: (df[df['Depth'] <= x['Depth'] + 1.0]['Value1'].values.sum() /
len(df[df['Depth'] <= x['Depth'] + 1.0]['Value1'].values)),
axis=1)
df['avg2_1'] = df.apply(lambda x: (df[df['Depth'] <= x['Depth'] + 1.0]['Value2'].values.sum() /
len(df[df['Depth'] <= x['Depth'] + 1.0]['Value2'].values)),
axis=1)
df['avg1_2'] = df.apply(lambda x: (df[df['Depth'] <= x['Depth'] + 2.0]['Value1'].values.sum() /
len(df[df['Depth'] <= x['Depth'] + 2.0]['Value1'].values)),
axis=1)
df['avg2_2'] = df.apply(lambda x: (df[df['Depth'] <= x['Depth'] + 2.0]['Value2'].values.sum() /
len(df[df['Depth'] <= x['Depth'] + 2.0]['Value2'].values)),
axis=1)
This would return:
Depth Value1 Value2 newval avg1_1 avg2_1 avg1_2 avg2_2
0 1.0 44 55 66.0 80.750000 60.250000 68.714286 53.857143
1 1.0 46 65 241.0 80.750000 60.250000 68.714286 53.857143
2 1.5 221 76 32.0 69.000000 59.000000 71.375000 48.625000
3 2.0 12 45 67.0 68.714286 53.857143 78.200000 44.100000
4 2.5 47 55 64.0 71.375000 48.625000 78.200000 44.100000
5 2.5 44 58 87.0 71.375000 48.625000 78.200000 44.100000
6 3.0 67 23 110.0 78.200000 44.100000 81.272727 42.090909
7 3.5 90 12 120.0 78.200000 44.100000 84.500000 40.750000
8 4.0 100 32 131.0 81.272727 42.090909 87.384615 40.384615
9 4.0 111 20 132.0 81.272727 42.090909 87.384615 40.384615
10 5.0 112 22 140.0 87.384615 40.384615 87.384615 40.384615
11 5.5 120 26 142.0 87.384615 40.384615 87.384615 40.384615
12 6.0 122 36 NaN 87.384615 40.384615 87.384615 40.384615
Related
I have a dataframe which has three Battery's charging and discharging sequence:
Battery 1 Battery 2 Battery 3
0 32 3 -1
1 21 11 -31
2 23 27 63
3 12 -22 -22
4 -21 22 44
5 -66 6 66
6 -12 32 -52
7 -45 -45 -4
8 45 -55 -77
9 66 66 96
10 99 -39 -69
11 88 99 48
if the number is negative then it will be charging and if it is positive then it will discharging. So what I added all the numbers rows and then try to the charging and discharging sequence.
import pandas as pd
dic1 = {
'Battery 1': [32,21,23,12,-21,-66,-12,-45,45,66,99,88],
'Battery 2': [3,11,27,-22,22,6,32,-45,-55,66,-39,99],
'Battery 3': [-1,-31,63,-22,44,66,-52,-4,-77,96,-69,48]
}
df = pd.DataFrame(dic1)
bess = df.filter(like='Battery').sum(axis=1) # Adding all batteries
charging = bess[bess<=0].fillna(0) #Charging
discharging = bess[bess>0].fillna(0) #Discharging
bess['charging'] = charging #creating new column for charging
bess['discharging'] = discharging #creating new column for discharging
print(bess)
Excpected output:
bess charging discharging
0 34 0.0 34.0
1 1 0.0 1.0
2 113 0.0 113.0
3 -32 -32.0 0.0
4 45 0.0 45.0
5 6 0.0 6.0
6 -32 -32.0 0.0
7 -94 -94.0 0.0
8 -87 -87.0 0.0
9 228 0.0 228.0
10 -9 -9.0 0.0
11 235 0.0 235.0
but instead somehow this fillna is not filling 0 values and giving this output:
bess charging discharging
0 34 34
1 1 1
2 113 113
3 -32 -32
4 45 45
5 6 6
6 -32 -32
7 -94 -94
8 -87 -87
9 228 228
10 -9 -9
11 235 235
Change the lane here with reindex
charging = bess[bess<=0].reindex(df.index,fill_value=0) #Charging
discharging = bess[bess>0].reindex(df.index,fill_value=0) #Discharging
Here is a way with using clip
df.assign((bess = df.sum(axis=1),
charging = df.sum(axis=1).clip(upper = 0),
discharging = df.sum(axis=1).clip(0)))
I have used seaborn's titanic dataset as a proxy for my very large dataset to create the chart and data based on that.
The following code runs without any errors:
import seaborn as sns
import pandas as pd
import numpy as np
sns.set_theme(style="darkgrid")
# Load the example Titanic dataset
df = sns.load_dataset("titanic")
# split fare into decile groups and order them
df['fare_grp'] = pd.qcut(df['fare'], q=10,labels=None, retbins=False, precision=0).astype(str)
df.groupby(['fare_grp'],dropna=False).size()
df['fare_grp_num'] = pd.qcut(df['fare'], q=10,labels=False, retbins=False, precision=0).astype(str)
df.groupby(['fare_grp_num'],dropna=False).size()
df['fare_ord_grp'] = df['fare_grp_num'] + ' ' +df['fare_grp']
df['fare_ord_grp']
# set variables
target = 'survived'
ydim = 'fare_ord_grp'
xdim = 'embark_town'
#del [result]
non_events = pd.DataFrame(df[df[target]==0].groupby([ydim,xdim],as_index=False, dropna=False)[target].count()).rename(columns={target: 'non_events'})
non_events[xdim]=non_events[xdim].replace(np.nan, 'Missing', regex=True)
non_events[ydim]=non_events[ydim].replace(np.nan, 'Missing', regex=True)
non_events_total = pd.DataFrame(df[df[target]==0].groupby([xdim],dropna=False,as_index=False)[target].count()).rename(columns={target: 'non_events_total_by_xdim'}).replace(np.nan, 'Missing', regex=True)
events = pd.DataFrame(df[df[target]==1].groupby([ydim,xdim],as_index=False, dropna=False)[target].count()).rename(columns={target: 'events'})
events[xdim]=events[xdim].replace(np.nan, 'Missing', regex=True)
events[ydim]=events[ydim].replace(np.nan, 'Missing', regex=True)
events_total = pd.DataFrame(df[df[target]==1].groupby([xdim],dropna=False,as_index=False)[target].count()).rename(columns={target: 'events_total_by_xdim'}).replace(np.nan, 'Missing', regex=True)
grand_total = pd.DataFrame(df.groupby([xdim],dropna=False,as_index=False)[target].count()).rename(columns={target: 'total_by_xdim'}).replace(np.nan, 'Missing', regex=True)
grand_total=grand_total.merge(non_events_total, how='left', on=xdim).merge(events_total, how='left', on=xdim)
result = pd.merge(non_events, events, how="outer",on=[ydim,xdim])
result['total'] = result['non_events'].fillna(0) + result['events'].fillna(0)
result[xdim] = result[xdim].replace(np.nan, 'Missing', regex=True)
result = pd.merge(result, grand_total, how="left",on=[xdim])
result['survival rate %'] = round(result['events']/result['total']*100,2)
result['% event dist by xdim'] = round(result['events']/result['events_total_by_xdim']*100,2)
result['% non-event dist by xdim'] = round(result['non_events']/result['non_events_total_by_xdim']*100,2)
result['% total dist by xdim'] = round(result['total']/result['total_by_xdim']*100,2)
display(result)
value_name1 = "% dist by " + str(xdim)
dfl = pd.melt(result, id_vars=[ydim, xdim],value_vars =['% total dist by xdim'], var_name = 'Type',value_name=value_name1).drop(columns='Type')
dfl2 = dfl.pivot(index=ydim, columns=xdim, values=value_name1)
print(dfl2)
title1 = "% dist by " + str(xdim)
ax=dfl2.T.plot(kind='bar', stacked=True, rot=1, figsize=(8, 8), title=title1)
ax.set_xticklabels(ax.get_xticklabels(), rotation=45)
ax.legend(bbox_to_anchor=(1.0, 1.0),title = 'Fare Range')
ax.set_ylabel('% Dist')
for p in ax.patches:
width, height = p.get_width(), p.get_height()
x, y = p.get_xy()
ax.text(x+width/2, y+height/2,'{:.0f}%'.format(height),horizontalalignment='center', verticalalignment='center')
It produces the following stacked percent bar chart, which shows the % of total distribution by embark town.
I also want to show the survival rate along with the %distribution in each block. For example, for Queenstown, fare range 1 (7.6, 7.9], the % total distribution is 56%. I want to display the survival rate 37.21% as (56%, 37.21%). I am not able to figure it out. Kindly offer any suggestions. Thanks.
Here is the output summary table for reference
fare_ord_grp
embark_town
non_events
events
total
total_by_xdim
non_events_total_by_xdim
events_total_by_xdim
survival rate %
% event dist by xdim
% non-event dist by xdim
% total dist by xdim
0
0 (-0.1,7.6]
Cherbourg
22
7
29
168
75
93
24.14
7.53
29.33
17.26
1
0 (-0.1,7.6]
Queenstown
4
NaN
4
77
47
30
NaN
NaN
8.51
5.19
2
0 (-0.1,7.6]
Southampton
53
6
59
644
427
217
10.17
2.76
12.41
9.16
3
1 (7.6,7.9]
Queenstown
27
16
43
77
47
30
37.21
53.33
57.45
55.84
4
1 (7.6,7.9]
Southampton
34
10
44
644
427
217
22.73
4.61
7.96
6.83
5
2 (7.9,8]
Cherbourg
4
1
5
168
75
93
20
1.08
5.33
2.98
6
2 (7.9,8]
Southampton
83
13
96
644
427
217
13.54
5.99
19.44
14.91
7
3 (8.0,10.5]
Cherbourg
2
1
3
168
75
93
33.33
1.08
2.67
1.79
8
3 (8.0,10.5]
Queenstown
2
NaN
2
77
47
30
NaN
NaN
4.26
2.6
9
3 (8.0,10.5]
Southampton
56
17
73
644
427
217
23.29
7.83
13.11
11.34
10
4 (10.5,14.5]
Cherbourg
7
8
15
168
75
93
53.33
8.6
9.33
8.93
11
4 (10.5,14.5]
Queenstown
1
2
3
77
47
30
66.67
6.67
2.13
3.9
12
4 (10.5,14.5]
Southampton
40
26
66
644
427
217
39.39
11.98
9.37
10.25
13
5 (14.5,21.7]
Cherbourg
9
10
19
168
75
93
52.63
10.75
12
11.31
14
5 (14.5,21.7]
Queenstown
5
3
8
77
47
30
37.5
10
10.64
10.39
15
5 (14.5,21.7]
Southampton
37
24
61
644
427
217
39.34
11.06
8.67
9.47
16
6 (21.7,27]
Cherbourg
1
4
5
168
75
93
80
4.3
1.33
2.98
17
6 (21.7,27]
Queenstown
2
3
5
77
47
30
60
10
4.26
6.49
18
6 (21.7,27]
Southampton
40
39
79
644
427
217
49.37
17.97
9.37
12.27
19
7 (27.0,39.7]
Cherbourg
14
10
24
168
75
93
41.67
10.75
18.67
14.29
20
7 (27.0,39.7]
Queenstown
5
NaN
5
77
47
30
NaN
NaN
10.64
6.49
21
7 (27.0,39.7]
Southampton
38
24
62
644
427
217
38.71
11.06
8.9
9.63
22
8 (39.7,78]
Cherbourg
5
19
24
168
75
93
79.17
20.43
6.67
14.29
23
8 (39.7,78]
Southampton
37
28
65
644
427
217
43.08
12.9
8.67
10.09
24
9 (78.0,512.3]
Cherbourg
11
33
44
168
75
93
75
35.48
14.67
26.19
25
9 (78.0,512.3]
Queenstown
1
1
2
77
47
30
50
3.33
2.13
2.6
26
9 (78.0,512.3]
Southampton
9
30
39
644
427
217
76.92
13.82
2.11
6.06
27
2 (7.9,8]
Queenstown
NaN
5
5
77
47
30
100
16.67
NaN
6.49
28
9 (78.0,512.3]
Missing
NaN
2
2
2
NaN
2
100
100
NaN
100
dfl2.T is being plotted, but 'survival rate %' is in result. As such, the indices for the values from dfl2.T do not correspond with 'survival rate %'.
Because all of values in result['% total dist by xdim'] are
not unique, we can't use a dict of matched key-values.
Create a corresponding pivoted DataFrame for 'survival rate %', and then flatten it. All of the values will be in the same order as the '% total dist by xdim' values from dfl2.T. As such, they can be indexed.
With respect to dfl2.T, the plot API plots in column order, which means .flatten(order='F') must be used to flatten the array in the correct order to be indexed.
# create a corresponding pivoted dataframe for survival rate %
dfl3 = pd.melt(result, id_vars=[ydim, xdim],value_vars =['survival rate %'], var_name = 'Type',value_name=value_name1).drop(columns='Type')
dfl4 = dfl3.pivot(index=ydim, columns=xdim, values=value_name1)
# flatten dfl4.T in column order
dfl4_flattened = dfl4.T.to_numpy().flatten(order='F')
for i, p in enumerate(ax.patches):
width, height = p.get_width(), p.get_height()
x, y = p.get_xy()
# only print values when height is not 0
if height != 0:
# create the text string
text = f'{height:.0f}%, {dfl4_flattened[i]:.0f}%'
# annotate the bar segments
ax.text(x+width/2, y+height/2, text, horizontalalignment='center', verticalalignment='center')
Notes
Here we can see dfl2.T and dfl4.T
# dfl2.T
fare_ord_grp 0 (-0.1, 7.6] 1 (7.6, 7.9] 2 (7.9, 8.0] 3 (8.0, 10.5] 4 (10.5, 14.5] 5 (14.5, 21.7] 6 (21.7, 27.0] 7 (27.0, 39.7] 8 (39.7, 78.0] 9 (78.0, 512.3]
embark_town
Cherbourg 17.26 NaN 2.98 1.79 8.93 11.31 2.98 14.29 14.29 26.19
Missing NaN NaN NaN NaN NaN NaN NaN NaN NaN 100.00
Queenstown 5.19 55.84 6.49 2.60 3.90 10.39 6.49 6.49 NaN 2.60
Southampton 9.16 6.83 14.91 11.34 10.25 9.47 12.27 9.63 10.09 6.06
# dfl4.T
fare_ord_grp 0 (-0.1, 7.6] 1 (7.6, 7.9] 2 (7.9, 8.0] 3 (8.0, 10.5] 4 (10.5, 14.5] 5 (14.5, 21.7] 6 (21.7, 27.0] 7 (27.0, 39.7] 8 (39.7, 78.0] 9 (78.0, 512.3]
embark_town
Cherbourg 24.14 NaN 20.00 33.33 53.33 52.63 80.00 41.67 79.17 75.00
Missing NaN NaN NaN NaN NaN NaN NaN NaN NaN 100.00
Queenstown NaN 37.21 100.00 NaN 66.67 37.50 60.00 NaN NaN 50.00
Southampton 10.17 22.73 13.54 23.29 39.39 39.34 49.37 38.71 43.08 76.92
I have a python dataframe and some columns refer to repeated samples as below:
In [3]: df = pd.DataFrame(
...: [[89, 89, 12, 34, 32],
...: [788, 25, 55, 65, 55],
...: [588, 23, 58, 8, 55],
...: [25, 14, 45, 123, 58]],
...: columns = ['sample1','sample2.1','sample2.2','sample3','sample4'],
...: )
In [4]: df
sample1 sample2.1 sample2.2 sample3 sample4
0 89 89 12 34 32
1 788 25 55 65 55
2 588 23 58 8 55
3 25 14 45 123 58
for the repeated samples, sample2.1 and sample2.2, I want to remain with an average of the two, i.e
sample1 sample2_averaged sample3 sample4
0 89 50.5 34 32
1 788 40.0 65 55
2 588 40.5 8 55
3 25 29.5 123 58
I am thinking of using regex but I have never used them on python dataframes
You can group by columns if you provide axis=1, e.g.:
>>> df.groupby(df.columns.str.replace(r'\..+', ''), axis=1).mean()
sample1 sample2 sample3 sample4
0 89.0 50.5 34.0 32.0
1 788.0 40.0 65.0 55.0
2 588.0 40.5 8.0 55.0
3 25.0 29.5 123.0 58.0
Pandas columns and indices can use the pandas.Series.str string accessor methods, including regex.
I would do:
(df.T.groupby(df.columns.str.extract('^([^\.]+)')[0].values)
.mean().T
)
Output:
sample1 sample2 sample3 sample4
0 89.0 50.5 34.0 32.0
1 788.0 40.0 65.0 55.0
2 588.0 40.5 8.0 55.0
3 25.0 29.5 123.0 58.0
Try:
import re
from itertools import groupby
res=pd.DataFrame(index=df.index, columns=[])
for k,v in groupby(df.columns, key=lambda el: re.sub(r"\.[^\.]+$", "", el)):
v=list(v)
if(len(v)==1):
res[k]=df[v[0]]
else:
res[k]=df[v].mean(axis=1)
Outputs:
>>> res
sample1 sample2 sample3 sample4
0 89 50.5 34 32
1 788 40.0 65 55
2 588 40.5 8 55
3 25 29.5 123 58
I would like to apply a function to one pandas dataframe column which does the following task:
I have a cycle counter that starts from a value but sometimes restarts.
I would like to have the counter continue and increase its value.
The function I use at the moment is the following one:
Code
import pandas as pd
d = {'Cycle':[100,100,100,100,101,101,101,102,102,102,102,102,102,103,103,103,100,100,100,100,101,101,101,101]}
df = pd.DataFrame(data=d)
df.loc[:,'counter'] = df['Cycle'].to_numpy()
df.loc[:,'counter'] = df['counter'].rolling(2).apply(lambda x: x[0] if (x[0] == x[1]) else x[0]+1, raw=True)
print(df)
Output
Cycle counter
0 100 NaN
1 100 100.0
2 100 100.0
3 100 100.0
4 101 101.0
5 101 101.0
6 101 101.0
7 102 102.0
8 102 102.0
9 102 102.0
10 102 102.0
11 102 102.0
12 102 102.0
13 103 103.0
14 103 103.0
15 103 103.0
16 100 104.0
17 100 100.0
18 100 100.0
19 100 100.0
20 101 101.0
21 101 101.0
22 101 101.0
23 101 101.0
My goal is to get a dataframe similar to this one:
Cycle counter
0 100 NaN
1 100 100.0
2 100 100.0
3 100 100.0
4 101 101.0
5 101 101.0
6 101 101.0
7 102 102.0
8 102 102.0
9 102 102.0
10 102 102.0
11 102 102.0
12 102 102.0
13 103 103.0
14 103 103.0
15 103 103.0
16 100 104.0
17 100 104.0
18 100 104.0
19 100 104.0
20 101 105.0
21 101 105.0
22 101 105.0
23 101 105.0
How do I use the rolling function with one overlap?
Do you have any recommendation to reach my goal?
Best regards,
Matteo
Another approach would be to identify the points in the Cycle column where the value changes using .diff(). Then at those points increment from the original initial cycle value and merge to the original dataframe forward filling the new values.
df2 = df[df['Cycle'].diff().apply(lambda x: x!=0)].reset_index()
df2['Target Count'] = df[df['Cycle'].diff().apply(lambda x: x!=0)].reset_index().reset_index().apply(lambda x: df.iloc[0,0] + x['level_0'], axis = 1)
df = df.merge(df2.drop('Cycle', axis = 1), right_on = 'index', left_index = True, how = 'left').ffill().set_index('index', drop = True)
def df.index.name
df
Cycle Target Count
0 100 100.0
1 100 100.0
2 100 100.0
3 100 100.0
4 101 101.0
5 101 101.0
6 101 101.0
7 102 102.0
8 102 102.0
9 102 102.0
10 102 102.0
11 102 102.0
12 102 102.0
13 103 103.0
14 103 103.0
15 103 103.0
16 100 104.0
17 100 104.0
18 100 104.0
19 100 104.0
20 101 105.0
21 101 105.0
22 101 105.0
23 101 105.0
We can use shift and ne (same as !=) to check where the Cycle column changes.
Then we use cumsum to make a counter which changes each time Cycle changes.
We add the first value of Cycle to the counter -1, to let it start at 100:
groups = df['Cycle'].ne(df['Cycle'].shift()).cumsum()
df['counter'] = groups + df['Cycle'].iat[0] - 1
Cycle counter
0 100 100
1 100 100
2 100 100
3 100 100
4 101 101
5 101 101
6 101 101
7 102 102
8 102 102
9 102 102
10 102 102
11 102 102
12 102 102
13 103 103
14 103 103
15 103 103
16 100 104
17 100 104
18 100 104
19 100 104
20 101 105
21 101 105
22 101 105
23 101 105
Details: groups gives us a counter starting at 1:
print(groups)
0 1
1 1
2 1
3 1
4 2
5 2
6 2
7 3
8 3
9 3
10 3
11 3
12 3
13 4
14 4
15 4
16 5
17 5
18 5
19 5
20 6
21 6
22 6
23 6
Name: Cycle, dtype: int64
I have the following MultiIndex dataframe.
Close ATR
Date Symbol
1990-01-01 A 24 2
1990-01-01 B 72 7
1990-01-01 C 40 3.4
1990-01-02 A 21 1.5
1990-01-02 B 65 6
1990-01-02 C 45 4.2
1990-01-03 A 19 2.5
1990-01-03 B 70 6.3
1990-01-03 C 51 5
I want to calculate three columns:
Shares = previous day's Equity * 0.02 / ATR, rounded down to whole number
Profit = Shares * Close
Equity = previous day's Equity + sum of Profit for each Symbol
Equity has an initial value of 10,000.
The expected output is:
Close ATR Shares Profit Equity
Date Symbol
1990-01-01 A 24 2 0 0 10000
1990-01-01 B 72 7 0 0 10000
1990-01-01 C 40 3.4 0 0 10000
1990-01-02 A 21 1.5 133 2793 17053
1990-01-02 B 65 6 33 2145 17053
1990-01-02 C 45 4.2 47 2115 17053
1990-01-03 A 19 2.5 136 2584 26885
1990-01-03 B 70 6.3 54 3780 26885
1990-01-03 C 51 5 68 3468 26885
I suppose I need a for loop or a function to be applied to each row. With these I have two issues. One is that I'm not sure how I can create a for loop for this logic in case of a MultiIndex dataframe. The second is that my dataframe is pretty large (something like 10 million rows) so I'm not sure if a for loop would be a good idea. But then how can I create these columns?
This solution can surely be cleaned up, but will produce your desired output. I've included your initial conditions in the construction of your sample dataframe:
import pandas as pd
import numpy as np
df = pd.DataFrame({'Date': ['1990-01-01','1990-01-01','1990-01-01','1990-01-02','1990-01-02','1990-01-02','1990-01-03','1990-01-03','1990-01-03'],
'Symbol': ['A','B','C','A','B','C','A','B','C'],
'Close': [24, 72, 40, 21, 65, 45, 19, 70, 51],
'ATR': [2, 7, 3.4, 1.5, 6, 4.2, 2.5, 6.3, 5],
'Shares': [0, 0, 0, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan],
'Profit': [0, 0, 0, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan]})
Gives:
Date Symbol Close ATR Shares Profit
0 1990-01-01 A 24 2.0 0.0 0.0
1 1990-01-01 B 72 7.0 0.0 0.0
2 1990-01-01 C 40 3.4 0.0 0.0
3 1990-01-02 A 21 1.5 NaN NaN
4 1990-01-02 B 65 6.0 NaN NaN
5 1990-01-02 C 45 4.2 NaN NaN
6 1990-01-03 A 19 2.5 NaN NaN
7 1990-01-03 B 70 6.3 NaN NaN
8 1990-01-03 C 51 5.0 NaN NaN
Then use groupby() with apply() and track your Equity globally. Took me a second to realize that the nature of this problem requires you to group on two separate columns individually (Symbol and Date):
start = 10000
Equity = 10000
def calcs(x):
global Equity
if x.index[0]==0: return x #Skip first group
x['Shares'] = np.floor(Equity*0.02/x['ATR'])
x['Profit'] = x['Shares']*x['Close']
Equity += x['Profit'].sum()
return x
df = df.groupby('Date').apply(calcs)
df['Equity'] = df.groupby('Date')['Profit'].transform('sum')
df['Equity'] = df.groupby('Symbol')['Equity'].cumsum()+start
This yields:
Date Symbol Close ATR Shares Profit Equity
0 1990-01-01 A 24 2.0 0.0 0.0 10000.0
1 1990-01-01 B 72 7.0 0.0 0.0 10000.0
2 1990-01-01 C 40 3.4 0.0 0.0 10000.0
3 1990-01-02 A 21 1.5 133.0 2793.0 17053.0
4 1990-01-02 B 65 6.0 33.0 2145.0 17053.0
5 1990-01-02 C 45 4.2 47.0 2115.0 17053.0
6 1990-01-03 A 19 2.5 136.0 2584.0 26885.0
7 1990-01-03 B 70 6.3 54.0 3780.0 26885.0
8 1990-01-03 C 51 5.0 68.0 3468.0 26885.0
can you try using shift and groupby? Once you have the value of the previous line, all columns operations are straight forward.
table2['previous'] = table2['close'].groupby('symbol').shift(1)
table2
date symbol close atr previous
1990-01-01 A 24 2 NaN
B 72 7 NaN
C 40 3.4 NaN
1990-01-02 A 21 1.5 24
B 65 6 72
C 45 4.2 40
1990-01-03 A 19 2.5 21
B 70 6.3 65
C 51 5 45