How to increment Dataframe row name value?

How to increment Dataframe row name value? - python

I have this dataframe:
OP1 OP2 OP3 OP4 OP5 OP6 OP7 OP8 OP9 OP10 Total
Simulation1 NaN 0.0 471294.2 692828.5 1107766.9 1580052.7 2452123.5 4374088.4 4545222.2 4764249.9 19987626.3
OP1 OP2 OP3 OP4 OP5 OP6 OP7 OP8 OP9 OP10 Total
Simulation1 NaN 0.0 333833.4 573533.5 948961.2 1343783.2 2354595.9 4061858.2 4348907.9 4769410.1 18734883.4
OP1 OP2 OP3 OP4 OP5 OP6 OP7 OP8 OP9 OP10 Total
Simulation1 NaN 0.0 441838.4 660710.6 976074.4 1391775.4 2002799.8 3921497.8 3708159.7 3852268.8 16955124.9
I need this row name as like this :
OP1 OP2 OP3 OP4 OP5 OP6 OP7 OP8 OP9 OP10 Total
Simulation1 NaN 0.0 471294.2 692828.5 1107766.9 1580052.7 2452123.5 4374088.4 4545222.2 4764249.9 19987626.3
Simulation2 NaN 0.0 333833.4 573533.5 948961.2 1343783.2 2354595.9 4061858.2 4348907.9 4769410.1 18734883.4
Simulation3 NaN 0.0 441838.4 660710.6 976074.4 1391775.4 2002799.8 3921497.8 3708159.7 3852268.8 16955124.9
.
.
.
So on
Here I have to increment row name simulation1 ,simulation2...so on
I have this code:
simulationDf=pd.DataFrame(columns['OP1','OP2','OP3','OP4','OP5','OP6','OP7','OP8','OP9','OP10','Total'])
simulationDf.loc['Simulation1'] = ultiCalc['Reserves']

Seems like that is your index(the dataframe that you posted in second code block) so you can use .index attribute and list comprehension:
df.index=['Simulation'+str(x) for x in range (1,len(df)+1)]
Now if you print df you will get your desired output:
OP1 OP2 OP3 OP4 OP5 OP6 OP7 OP8 OP9 OP10 Total
Simulation1 NaN 0.0 471294.2 692828.5 1107766.9 1580052.7 2452123.5 4374088.4 4545222.2 4764249.9 19987626.3
Simulation2 NaN 0.0 333833.4 573533.5 948961.2 1343783.2 2354595.9 4061858.2 4348907.9 4769410.1 18734883.4
Simulation3 NaN 0.0 441838.4 660710.6 976074.4 1391775.4 2002799.8 3921497.8 3708159.7 3852268.8 16955124.9

Related

How to retrieve the columns of DataFrame within the loop in Python?

I have this below output. I have wrote code for this inside while loop for looping. Here i enter 3 then it creates 3 different dataframes with different values.
Enter the Number to iter:3
Paid CDF Ultimate Reserves
OP1 3901463.0 NaN NaN NaN
OP2 5339085.0 1.000000 5339085.000 0.0
OP3 4909315.0 1.085767 5331516.090 422201.1
OP4 4588268.0 1.136096 5212272.448 624004.4
OP5 3873311.0 1.238680 4799032.329 925721.3
OP6 3691712.0 1.350145 4983811.200 1292099.2
OP7 3483130.0 1.602168 5579974.260 2096844.3
OP8 2864498.0 2.334476 6685738.332 3821240.3
OP9 1363294.0 4.237940 5777639.972 4414346.0
OP10 344014.0 15.204053 5230388.856 4886374.9
Total NaN NaN NaN 18482831.5
Paid CDF Ultimate Reserves
OP1 3901463.0 NaN NaN NaN
OP2 5339085.0 1.000000 5339085.000 0.0
OP3 4909315.0 1.090000 5351153.350 441838.4
OP4 4588268.0 1.137559 5221448.984 633181.0
OP5 3873311.0 1.231368 4768045.841 894734.8
OP6 3691712.0 1.331933 4917360.384 1225648.4
OP7 3483130.0 1.563703 5447615.320 1964485.3
OP8 2864498.0 2.318600 6642770.862 3778272.9
OP9 1363294.0 4.234960 5773550.090 4410256.1
OP10 344014.0 16.958969 5834133.426 5490119.4
Total NaN NaN NaN 18838536.3
Paid CDF Ultimate Reserves
OP1 3901463.0 NaN NaN NaN
OP2 5339085.0 1.000000 5339085.000 0.0
OP3 4909315.0 1.072698 5267694.995 358380.0
OP4 4588268.0 1.130229 5184742.840 596474.8
OP5 3873311.0 1.208164 4678959.688 805648.7
OP6 3691712.0 1.267187 4677399.104 985687.1
OP7 3483130.0 1.497767 5217728.740 1734598.7
OP8 2864498.0 2.229342 6384966.042 3520468.0
OP9 1363294.0 4.219405 5751737.386 4388443.4
OP10 344014.0 16.036065 5516608.504 5172594.5
Total NaN NaN NaN 17562295.2
Using Above reserve column i have to generate below dataframe like simulation1 ,simulation2 so on till the number of reserve column generated by user input.
OP1 OP2 OP3 OP4 OP5 OP6 OP7 OP8 OP9 OP10 Total
Simulation1 NaN 0.0 422201.1 624004.4 925721.3 1292099.2 2096844.3 3821240.3 4414346.0 4886374.9 18482831.5
Simulation2 NaN 0.0 441838.4 633181.0 894734.8 1225648.4 1964485.3 3778272.9 4410256.1 5490119.4 18838536.3
Simulation3 NaN 0.0 358380.0 596474.8 805648.7 985687.1 1734598.7 3520468.0 4388443.4 5172594.5 17562295.2
I have below code:
itno=int(input("Enter the Number to iter:"))
iter_count=0
while (iter_count < itno):
randomdf = scaledDF.copy()
choices = randomdf.values[~pd.isnull(randomdf.values)]
randomdf = randomdf.applymap(lambda x: np.random.choice(choices) if not pd.isnull(x) else x)
#print(randomdf,"\n\n")
cumdf = CumulativePaidTriangledf.iloc[:, :-1][:-2].copy()
ldfdf = LDFTriangledf.iloc[:, :-1].copy()
ResampledDF = randomdf.copy()
for colname4, col4 in ResampledDF.iteritems():
ResampledDF[colname4] = (ResampledDF[colname4] * (Variencedf[colname4][-1]/(cumdf[colname4]**0.5)))+ldfdf[colname4][-1]
#print(ResampledDF,"\n\n")
#SUMPRODUCT:
sumPro = ResampledDF.copy()
#cumdf = cumdf.iloc[:, :-1]
for colname5,col5 in sumPro.iteritems():
sumPro[colname5] = (sumPro[colname5].round(2))*cumdf[colname5]
sumPro = sumPro.append(pd.Series(sumPro.sum(), name='SUMPRODUCT'))
#print(sumPro)
#SUM(OFFSET):
sumOff = cumdf.apply(lambda x: x.iloc[:cumdf.index.get_loc(x.last_valid_index())].sum())
#print(sumOff)
#Weighted avg:
Weighted_avg = sumPro.loc['SUMPRODUCT']/sumOff
#print(Weighted_avg)
ResampledDF = ResampledDF.append(pd.Series(Weighted_avg, name='Weighted Avg'))
#print(ResampledDF,"\n\n")
'''for colname6,col6 in ResampledDF.iteritems():
ResampledDF[colname6] = ResampledDF[colname6].replace({'0':np.nan, 0:np.nan})
print(ResampledDF)'''
ResampledDF.loc['Weighted Avg'] = ResampledDF.loc['Weighted Avg'].replace(0, 1)
c = ResampledDF.iloc[-1][::-1].cumprod()[::-1]
ResampledDF = ResampledDF.append(pd.Series(c,name='CDF'))
#print("\n\n",ResampledDF)
#Getting Calculation of ultimates:
s = CumulativePaidTriangledf.iloc[:, :][:-2].copy()
ultiCalc = pd.DataFrame()
ultiCalc['Paid']= s['Total']
ultiCalc['CDF'] = np.flip(ResampledDF.loc['CDF'].values)
ultiCalc['Ultimate'] = ultiCalc['Paid']*(ultiCalc['CDF']).round(3)
ultiCalc['Reserves'] = (ultiCalc['Ultimate'].round(1))-ultiCalc['Paid']
ultiCalc.loc['Total'] = pd.Series(ultiCalc['Reserves'].sum(), index = ['Reserves']).round(2)
print("\n\n",ultiCalc)
iter_count+=1
#Getting Simulations:
simulationDf = pd.DataFrame(columns=['OP1','OP2','OP3','OP4','OP5','OP6','OP7','OP8','OP9','OP10','Total'])
simulationDf.loc['Simulation'] = ultiCalc['Reserves']
print("\n\n",simulationDf)
Current Output:
Simulation1 NaN
Simulation2 0.0
Simulation3 353470.7
Simulation4 559768.7
Simulation5 859875.0
Simulation6 1162889.3
Simulation7 1828643.2
Simulation8 3958736.2
Simulation9 4464787.9
Simulation10 5224196.6
Simulation11 18412367.6
Simulation12 NaN
Simulation13 0.0
Simulation14 402563.8
Simulation15 669887.1
Simulation16 883114.9
Simulation17 1185039.6
Simulation18 1859991.4
Simulation19 3511874.5
Simulation20 3875844.8
Simulation21 4481126.4
Simulation22 16869442.5

Use list comprehension for loop by list of DataFrames with select column Reserves and join together by DataFrame constructor, last if necessary set index:
dfs = [df1, df2, df3]
df = pd.DataFrame([x['Reserves'] for x in dfs]).reset_index(drop=True)
df.index = 'Simulation' + (df.index + 1).astype(str)
If there is some loop for generate DataFrames like pseudocode:
#create list outside loop
dfs = []
iter_count=0
while (iter_count < itno):
randomdf = scaledDF.copy()
choices = randomdf.values[~pd.isnull(randomdf.values)]
randomdf = randomdf.applymap(lambda x: np.random.choice(choices) if not pd.isnull(x) else x)
#print(randomdf,"\n\n")
....
....
#Getting Calculation of ultimates:
s = CumulativePaidTriangledf.iloc[:, :][:-2].copy()
ultiCalc = pd.DataFrame()
ultiCalc['Paid']= s['Total']
ultiCalc['CDF'] = np.flip(ResampledDF.loc['CDF'].values)
ultiCalc['Ultimate'] = ultiCalc['Paid']*(ultiCalc['CDF']).round(3)
ultiCalc['Reserves'] = (ultiCalc['Ultimate'].round(1))-ultiCalc['Paid']
ultiCalc.loc['Total'] = pd.Series(ultiCalc['Reserves'].sum(), index = ['Reserves']).round(2)
print("\n\n",ultiCalc)
iter_count+=1
#append in loop
dfs.append(ultiCalc['Reserves'])
And then outside loops join together:
df = pd.DataFrame([x for x in dfs]).reset_index(drop=True)
df.index = 'Simulation' + (df.index + 1).astype(str)

How to iterate through columns of the dataframe?

I want go through the all the columns of the dataframe. so that I will get a particular data of the column, using these data I have to calculate for another dataframe.
Here i have :
DP1 DP2 DP3 DP4 DP5 DP6 DP7 DP8 DP9 DP10 Total
OP1 357848.0 1124788.0 1735330.0 2218270.0 2745596.0 3319994.0 3466336.0 3606286.0 3833515.0 3901463.0 3901463.0
OP2 352118.0 1236139.0 2170033.0 3353322.0 3799067.0 4120063.0 4647867.0 4914039.0 5339085.0 NaN 5339085.0
OP3 290507.0 1292306.0 2218525.0 3235179.0 3985995.0 4132918.0 4628910.0 4909315.0 NaN NaN 4909315.0
OP4 310608.0 1418858.0 2195047.0 3757447.0 4029929.0 4381982.0 4588268.0 NaN NaN NaN 4588268.0
OP5 443160.0 1136350.0 2128333.0 2897821.0 3402672.0 3873311.0 NaN NaN NaN NaN 3873311.0
OP6 396132.0 1333217.0 2180715.0 2985752.0 3691712.0 NaN NaN NaN NaN NaN 3691712.0
OP7 440832.0 1288463.0 2419861.0 3483130.0 NaN NaN NaN NaN NaN NaN 3483130.0
OP8 359480.0 1421128.0 2864498.0 NaN NaN NaN NaN NaN NaN NaN 2864498.0
OP9 376686.0 1363294.0 NaN NaN NaN NaN NaN NaN NaN NaN 1363294.0
OP10 344014.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN 344014.0
Total 3671385.0 11614543.0 17912342.0 21930921.0 21654971.0 19828268.0 17331381.0 13429640.0 9172600.0 3901463.0 34358090.0
Latest Observation 344014.0 1363294.0 2864498.0 3483130.0 3691712.0 3873311.0 4588268.0 4909315.0 5339085.0 3901463.0 NaN
From this table I would like to calculate formula this formula :in column DP1,Total/Last observation and this answer is divides by DP2 columns total. Like this we have to calculate all the columns and save it in another dataframe.
we need row like this :
Weighted Average 3.491 1.747 1.457 1.174 1.104 1.086 1.054 1.077 1.018
This code we tried:
LDFTriangledf['Weighted Average'] =CumulativePaidTriangledf.loc['Total','DP2']/(CumulativePaidTriangledf.loc['Total','DP1'] - CumulativePaidTriangledf.loc['Latest Observation','DP1'])

You can remove the column names from .loc and just shift(-1, axis=1) to get the next column's Total. This lets you apply the formula to all columns in a single operation:
CumulativePaidTriangledf.shift(-1, axis=1).loc['Total'] / (CumulativePaidTriangledf.loc['Total'] - CumulativePaidTriangledf.loc['Latest Observation'])
# DP1 3.490607
# DP2 1.747333
# DP3 1.457413
# DP4 1.173852
# DP5 1.103824
# DP6 1.086269
# DP7 1.053874
# DP8 1.076555
# DP9 1.017725
# DP10 inf
# Total NaN
# dtype: float64
Here is a breakdown of what the three components are doing:
DP1
DP2
DP3
DP4
DP5
DP6
DP7
DP8
DP9
DP10
Total
A: .shift(-1, axis=1).loc['Total'] -- We are shifting the whole Total row to the left, so every column now has the next Total value.
1.161454e+07
1.791234e+07
2.193092e+07
2.165497e+07
1.982827e+07
1.733138e+07
1.342964e+07
9.172600e+06
3.901463e+06
34358090.0
NaN
B: .loc['Total'] -- This is the normal Total row.
3.671385e+06
1.161454e+07
1.791234e+07
2.193092e+07
2.165497e+07
1.982827e+07
1.733138e+07
1.342964e+07
9.172600e+06
3901463.0
34358090.0
C: .loc['Latest Observation'] -- This is the normal Latest Observation.
3.440140e+05
1.363294e+06
2.864498e+06
3.483130e+06
3.691712e+06
3.873311e+06
4.588268e+06
4.909315e+06
5.339085e+06
3901463.0
NaN
A / (B-C) -- This is what the code above does. It takes the shifted Total row (A) and divides it by the difference of the current Total row (B) and current Latest observation row (C).
3.490607
1.747333
1.457413
1.173852
1.103824
1.086269
1.053874
1.076555
1.017725
inf
NaN

How to randomly create a preference dataframe from a dataframe of choices?

I have a Dataframe of vote and I would like to create one of preferences.
For example here is the number of votes for each party P1, P2, P3 in each city Comm, Comm2 ...
Comm Votes P1 P2 P3
0 comm1 1315.0 2.0 424.0 572.0
1 comm2 4682.0 117.0 2053.0 1584.0
2 comm3 2397.0 2.0 40.0 192.0
3 comm4 931.0 2.0 12.0 345.0
4 comm5 842.0 47.0 209.0 76.0
... ... ... ... ... ...
1524 comm1525 10477.0 13.0 673.0 333.0
1525 comm1526 2674.0 1.0 55.0 194.0
1526 comm1527 1691.0 331.0 29.0 78.0
These electoral results would suffice for a first pass the ballot system, I would like to test the alternative election model. So for each political party I need to get the preferences.
As I don't know the preferences, I want to make them with random numbers. I suppose that voters are honest. For example, for the "P1" party in town "comm" We know that 2 people voted for it and that there are 1315 voters. I need to create preferences to see if people would put it as their first, second or third option. It is to say, and for each party:
Comm Votes P1_1 P1_2 P1_3 P2_1 P2_2 P2_3 P3_1 P3_2 P3_3
0 comm1 1315.0 2.0 1011.0 303.0 424.0 881.0 10.0 570.0 1.0 1.0
... ... ... ... ... ...
1526 comm1527 1691.0 331.0 1300.0 60.0 299.0 22.0 10.0 ...
So I have to do:
# for each column in parties I create (parties -1) other columns
# I rename them all Party_i. The former 1 becomes Party_1.
# In the other columns I put a random number.
# For a given line, the sum of all Party_i for i in [1, parties] mus t be equal to Votes
I tried this so far:
parties = [item for item in df.columns if item not in ['Comm','Votes']]
for index, row in df_test.iterrows():
# In the other columns I put a random number.
for party in parties:
# for each column in parties I create (parties -1) other columns
for i in range(0,len(parties) -1):
print(random.randrange(0, row['Votes']))
# I rename them all Party_i. The former 1 becomes Party_1.
row["{party}_{preference}".format(party = party,preference = i)] = random.randrange(0, row['Votes']) if (row[party] < row['Votes']) else 0 # false because the sum of the votes isn't = to df['Votes']
The results are:
Comm Votes ... P1_1 P1_2 P1_3 P2_1 P2_2 P2_3 P3_1 P3_2 P3_3
0 comm1 1315.0 ... 1003 460 1588 1284 1482 1613 1429 345
1 comm2 1691.0 ... 1003 460 1588 1284 1482 1613 ...
...
But:
the numbers are the same for each rows
the value in row of Pi_1 isn't equal to the one in the row of Pi (Pi being a given party).
the sum of Pi_j for all j in [0, parties] isn't equal to the number in the column Votes
Update
I tried Antihead's answer with his own data and it worked well. But when apllying to my own data it doesn't. It leaves me an empty dataframe:
import collections
def fill_cells(cell):
v_max = cell['Votes']
all_dict = {}
#iterate over parties.copy()
for p in parties:
tmp_l = parties.copy()
tmp_l.remove(p)
# sample new data with equal choices
sampled = np.random.choice(tmp_l, int(v_max-cell[p]))
# transform into dictionary
c_sampled = dict(collections.Counter(sampled))
c_sampled.update({p:cell[p]})
# batch update of the dictio~nary keys
all_dict.update(
dict(zip([p+'_%s' %k[1] for k in c_sampled.keys()], c_sampled.values()))
)
return pd.Series(all_dict)
Indeed, with the following dataframe:
Comm Votes LPC CPC BQ
0 comm1 1315.0 2.0 424.0 572.0
1 comm2 4682.0 117.0 2053.0 1584.0
2 comm3 2397.0 2.0 40.0 192.0
3 comm4 931.0 2.0 12.0 345.0
4 comm5 842.0 47.0 209.0 76.0
... ... ... ... ... ...
1522 comm1523 23808.0 1588.0 4458.0 13147.0
1523 comm1524 639.0 40.0 126.0 40.0
1524 comm1525 10477.0 13.0 673.0 333.0
1525 comm1526 2674.0 1.0 55.0 194.0
1526 comm1527 1691.0 331.0 29.0 78.0
I have an empty dataframe:
0
1
2
3
4
...
1522
1523
1524
1525
1526

Does this work:
# data
columns = ['Comm', 'Votes', 'P1', 'P2', 'P3']
data =[['comm1', 1315.0, 2.0, 424.0, 572.0],
['comm2', 4682.0, 117.0, 2053.0, 1584.0],
['comm3', 2397.0, 2.0, 40.0, 192.0],
['comm4', 931.0, 2.0, 12.0, 345.0],
['comm5', 842.0, 47.0, 209.0, 76.0],
['comm1525', 10477.0, 13.0, 673.0, 333.0],
['comm1526', 2674.0, 1.0, 55.0, 194.0],
['comm1527', 1691.0, 331.0, 29.0, 78.0]]
df =pd.DataFrame(data=data, columns=columns)
import collections
def fill_cells(cell):
v_max = cell['Votes']
all_dict = {}
#iterate over parties
for p in ['P1', 'P2', 'P3']:
tmp_l = ['P1', 'P2', 'P3']
tmp_l.remove(p)
# sample new data with equal choices
sampled = np.random.choice(tmp_l, int(v_max-cell[p]))
# transform into dictionary
c_sampled = dict(collections.Counter(sampled))
c_sampled.update({p:cell[p]})
# batch update of the dictionary keys
all_dict.update(
dict(zip([p+'_%s' %k[1] for k in c_sampled.keys()], c_sampled.values()))
)
return pd.Series(all_dict)
# get back a data frame
df.apply(fill_cells, axis=1)
If You need to merge the data frame back, do something like:
new_df = df.apply(fill_cells, axis=1)
pd.concat([df, new_df], axis=1)

Based on Antihead's answer and for the following dataset:
Comm Votes LPC CPC BQ
0 comm1 1315.0 2.0 424.0 572.0
1 comm2 4682.0 117.0 2053.0 1584.0
2 comm3 2397.0 2.0 40.0 192.0
3 comm4 931.0 2.0 12.0 345.0
4 comm5 842.0 47.0 209.0 76.0
... ... ... ... ... ...
1522 comm1523 23808.0 1588.0 4458.0 13147.0
1523 comm1524 639.0 40.0 126.0 40.0
1524 comm1525 10477.0 13.0 673.0 333.0
1525 comm1526 2674.0 1.0 55.0 194.0
1526 comm1527 1691.0 331.0 29.0 78.0
I tried:
def fill_cells(cell):
votes_max = cell['Votes']
all_dict = {}
#iterate over parties
parties_temp = parties.copy()
for p in parties_temp:
preferences = ['1','2','3']
for preference in preferences:
preferences.remove(preference)
# sample new data with equal choices
sampled = np.random.choice(preferences, int(votes_max-cell[p]))
# transform into dictionary
c_sampled = dict(collections.Counter(sampled))
c_sampled.update({p:cell[p]})
c_sampled['1'] = c_sampled.pop(p)
# batch update of the dictionary keys
all_dict.update(
dict(zip([p+'_%s' %k for k in c_sampled.keys()],c_sampled.values()))
)
return pd.Series(all_dict)
It returns
LPC_2 LPC_3 LPC_1 CPC_2 CPC_3 CPC_1 BQ_2 BQ_3 BQ_1
0 891.0 487.0 424.0 743.0 373.0 572.0 1313.0 683.0 2.0
1 2629.0 1342.0 2053.0 3098.0 1603.0 1584.0 4565.0 2301.0 117.0
2 2357.0 1186.0 40.0 2205.0 1047.0 192.0 2395.0 1171.0 2.0
3 919.0 451.0 12.0 586.0 288.0 345.0 929.0 455.0 2.0
4 633.0 309.0 209.0 766.0 399.0 76.0 795.0 396.0 47.0
... ... ... ... ... ... ... ... ... ...
1520 1088.0 536.0 42.0 970.0 462.0 160.0 1117.0 540.0 13.0
1521 4742.0 2341.0 219.0 3655.0 1865.0 1306.0 4705.0 2375.0 256.0
1522 19350.0 9733.0 4458.0 10661.0 5352.0 13147.0 22220.0 11100.0 1588.0
1523 513.0 264.0 126.0 599.0 267.0 40.0 599.0 306.0 40.0
1524 9804.0 4885.0 673.0 10144.0 5012.0 333.0 10464.0 5162.0 13.0
It's almost good. I would have prefered the preferences to be dynamically encoded rather than to hard code ['1','2','3'].

Python - faster way to run a for loop in a dataframe

I am running the following code to calculate for every dataframe row the number of positive days in the previous rows and the number of days in which the stock has beaten the S&P 500 index:
for offset in [1,5,15,30,45,60,75,90,120,150,
200,250,500,750,1000,1250,1500]:
asset['return_stock'] = (asset.Close - asset.Close.shift(1)) / (asset.Close.shift(1))
merged_data = pd.merge(asset, sp_500, on='Date')
total_positive_days=0
total_beating_sp_days=0
for index, row in merged_data.iterrows():
print(offset, index)
for i in range(0,offset):
if index-i-1>0:
if merged_data.loc[index-i,'Close_x'] > merged_data.loc[index-i-1,'Close_x']:
total_positive_days+=1
if merged_data.loc[index-i,'return_stock'] > merged_data.loc[index-i-1,'return_sp']:
total_beating_sp_days+=1
but it is quite slow. Is there a way to speed it up (possibly by somehow getting rid of the for loop)?
My dataset looks like this (merged_data follows):
Date Open_x High_x Low_x Close_x Adj Close_x Volume_x return_stock Pct_positive_1 Pct_beating_1 Pct_change_1 Pct_change_plus_1 Pct_positive_5 Pct_beating_5 Pct_change_5 Pct_change_plus_5 Pct_positive_15 Pct_beating_15 Pct_change_15 Pct_change_plus_15 Pct_positive_30 Pct_beating_30 Pct_change_30 Pct_change_plus_30 Open_y High_y Low_y Close_y Adj Close_y Volume_y return_sp
0 2010-01-04 30.490000 30.642857 30.340000 30.572857 26.601469 123432400 NaN 1311.0 1261.0 NaN -0.001726 1310.4 1260.8 NaN 0.018562 1307.2 1257.6 NaN 0.039186 1302.066667 1252.633333 NaN 0.056579 1116.560059 1133.869995 1116.560059 1132.989990 1132.989990 3991400000 0.016043
1 2010-01-05 30.657143 30.798571 30.464285 30.625713 26.647457 150476200 0.001729 1311.0 1261.0 0.001729 0.016163 1310.4 1260.8 NaN 0.032062 1307.2 1257.6 NaN 0.031268 1302.066667 1252.633333 NaN 0.056423 1132.660034 1136.630005 1129.660034 1136.520020 1136.520020 2491020000 0.003116
2 2010-01-06 30.625713 30.747143 30.107143 30.138571 26.223597 138040000 -0.015906 1311.0 1261.0 -0.015906 0.001852 1310.4 1260.8 NaN 0.001519 1307.2 1257.6 NaN 0.058608 1302.066667 1252.633333 NaN 0.046115 1135.709961 1139.189941 1133.949951 1137.140015 1137.140015 4972660000 0.000546
3 2010-01-07 30.250000 30.285715 29.864286 30.082857 26.175119 119282800 -0.001849 1311.0 1261.0 -0.001849 -0.006604 1310.4 1260.8 NaN 0.005491 1307.2 1257.6 NaN 0.096428 1302.066667 1252.633333 NaN 0.050694 1136.270020 1142.459961 1131.319946 1141.689941 1141.689941 5270680000 0.004001
4 2010-01-08 30.042856 30.285715 29.865715 30.282858 26.349140 111902700 0.006648 1311.0 1261.0 0.006648 0.008900 1310.4 1260.8 NaN 0.029379 1307.2 1257.6 NaN 0.088584 1302.066667 1252.633333 NaN 0.075713 1140.520020 1145.390015 1136.219971 1144.979980 1144.979980 4389590000 0.002882
asset follows:
Date Open High Low Close Adj Close Volume return_stock Pct_positive_1 Pct_beating_1 Pct_change_1 Pct_change_plus_1 Pct_positive_5 Pct_beating_5 Pct_change_5 Pct_change_plus_5
0 2010-01-04 30.490000 30.642857 30.340000 30.572857 26.601469 123432400 NaN 1311.0 1261.0 NaN -0.001726 1310.4 1260.8 NaN 0.018562
1 2010-01-05 30.657143 30.798571 30.464285 30.625713 26.647457 150476200 0.001729 1311.0 1261.0 0.001729 0.016163 1310.4 1260.8 NaN 0.032062
2 2010-01-06 30.625713 30.747143 30.107143 30.138571 26.223597 138040000 -0.015906 1311.0 1261.0 -0.015906 0.001852 1310.4 1260.8 NaN 0.001519
3 2010-01-07 30.250000 30.285715 29.864286 30.082857 26.175119 119282800 -0.001849 1311.0 1261.0 -0.001849 -0.006604 1310.4 1260.8 NaN 0.005491
4 2010-01-08 30.042856 30.285715 29.865715 30.282858 26.349140 111902700 0.006648 1311.0 1261.0 0.006648 0.008900 1310.4 1260.8 NaN 0.029379
sp_500 follows:
Date Open High Low Close Adj Close Volume return_sp
0 1999-12-31 1464.469971 1472.420044 1458.189941 1469.250000 1469.250000 374050000 NaN
1 2000-01-03 1469.250000 1478.000000 1438.359985 1455.219971 1455.219971 931800000 -0.009549
2 2000-01-04 1455.219971 1455.219971 1397.430054 1399.420044 1399.420044 1009000000 -0.038345
3 2000-01-05 1399.420044 1413.270020 1377.680054 1402.109985 1402.109985 1085500000 0.001922
4 2000-01-06 1402.109985 1411.900024 1392.099976 1403.449951 1403.449951 1092300000 0.000956

This is a partial answer.
I think the way you do
asset.Close - asset.Close.shift(1)
at the top is key to how you might do this. Instead of
if merged_data.loc[index-i,'Close_x'] > merged_data.loc[index-i-1,'Close_x']
create a column with the change in Close_x:
merged_data['Delta_Close_x'] = merged_data.Close_x - merged_data.Close_x.shift(1)
Similarly,
if merged_data.loc[index-i,'return_stock'] > merged_data.loc[index-i-1,'return_sp']
becomes
merged_data['vs_sp'] = merged_data.return_stock - merged_data.return_sp.shift(1)
Then you can iterate i and use subsets like
merged_data[merged_data['Delta_Close_x'] > 0 and merged_data['vs_sp'] > 0]
There are a lot of additional details to work out, but I hope this gets you started.

Rolling Mean not being calculated on a new column

I have an issue on calculating the rolling mean for a column I added in the code. For some reason, it doesnt work on the column I added but works on a column from the original csv.
Original dataframe from the csv as follow:
Open High Low Last Change Volume Open Int
Time
09/20/19 98.50 99.00 98.35 98.95 0.60 3305.0 0.0
09/19/19 100.35 100.75 98.10 98.35 -2.00 17599.0 0.0
09/18/19 100.65 101.90 100.10 100.35 0.00 18258.0 121267.0
09/17/19 103.75 104.00 100.00 100.35 -3.95 34025.0 122453.0
09/16/19 102.30 104.95 101.60 104.30 1.55 21403.0 127447.0
Ticker = pd.read_csv('\\......\Historical data\kcz19 daily.csv',
index_col=0, parse_dates=True)
Ticker['Return'] = np.log(Ticker['Last'] / Ticker['Last'].shift(1)).fillna('')
Ticker['ret20'] = Ticker['Return'].rolling(window=20, win_type='triang').mean()
print(Ticker.head())
Open High Low ... Open Int Return ret20
Time ...
09/20/19 98.50 99.00 98.35 ... 0.0
09/19/19 100.35 100.75 98.10 ... 0.0 -0.00608213 -0.00608213
09/18/19 100.65 101.90 100.10 ... 121267.0 0.0201315 0.0201315
09/17/19 103.75 104.00 100.00 ... 122453.0 0 0
09/16/19 102.30 104.95 101.60 ... 127447.0 0.0386073 0.0386073
ret20 column should have the rolling mean of the column Return so it should show some data starting from raw 21 whereas it is only a copy of column Return here.
If I replace with the Last column it will work.
Below is the result using colum Last
Open High Low ... Open Int Return ret20
Time ...
09/20/19 98.50 99.00 98.35 ... 0.0 NaN
09/19/19 100.35 100.75 98.10 ... 0.0 -0.00608213 NaN
09/18/19 100.65 101.90 100.10 ... 121267.0 0.0201315 NaN
09/17/19 103.75 104.00 100.00 ... 122453.0 0 NaN
09/16/19 102.30 104.95 101.60 ... 127447.0 0.0386073 NaN
09/13/19 103.25 103.60 102.05 ... 128707.0 -0.0149725 NaN
09/12/19 102.80 103.85 101.15 ... 128904.0 0.00823848 NaN
09/11/19 102.00 104.70 101.40 ... 132067.0 -0.00193237 NaN
09/10/19 98.50 102.25 98.00 ... 135349.0 -0.0175614 NaN
09/09/19 97.00 99.25 95.30 ... 137347.0 -0.0335283 NaN
09/06/19 95.35 97.30 95.00 ... 135399.0 -0.0122889 NaN
09/05/19 96.80 97.45 95.05 ... 136142.0 -0.0171477 NaN
09/04/19 95.65 96.95 95.50 ... 134864.0 0.0125002 NaN
09/03/19 96.00 96.60 94.20 ... 134685.0 -0.0109291 NaN
08/30/19 95.40 97.20 95.10 ... 134061.0 0.0135137 NaN
08/29/19 97.05 97.50 94.75 ... 132639.0 -0.0166584 NaN
08/28/19 97.40 98.15 95.95 ... 130573.0 0.0238601 NaN
08/27/19 97.35 98.00 96.40 ... 129921.0 -0.00410889 NaN
08/26/19 95.55 98.50 95.25 ... 129003.0 0.0035962 NaN
08/23/19 96.90 97.40 95.05 ... 130268.0 -0.0149835 98.97775
Appreciate any help

the .fillna('') is creating a string in the first row which then creates errors for the rolling calculation in Ticker['ret20'].
Delete this and the code will run fine:
Ticker['Return'] = np.log(Ticker['Last'] / Ticker['Last'].shift(1))

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to increment Dataframe row name value? - python

Related

How to retrieve the columns of DataFrame within the loop in Python?

How to iterate through columns of the dataframe?

How to randomly create a preference dataframe from a dataframe of choices?

Python - faster way to run a for loop in a dataframe

Rolling Mean not being calculated on a new column

Categories

Resources