I have a function that will do some calculations made in other parts of the code. I have it dynamically repeat the calculations based on the number of input (binary) a user indicates in an earlier section. The calculations are working but I cannot get this to create multiple columns in the dataframe. How can I get the function to create a new column for each repeat of the "while" loop? Is that even possible? I tried creating 9 columns to see if that would work but I get a "mismatched columns" error.
def multipleldlc():
outputldlc = pd.DataFrame(columns=["ldldcisfhprob1", "ldldcisfhprob2", "ldldcisfhprob3", "ldldcisfhprob4", "ldldcisfhprob5", "ldldcisfhprob6", "ldldcisfhprob7", "ldldcisfhprob8", "ldldcisfhprob9"])
y = 0
numcols = len(binary[0])
print(numcols)
while y < numcols:
for item in binary:
for i, j in pinput.iterrows():
agej =j.values[0]
ldlcj = j.values[1]
totalcj = j.values[2]
ldlcisfh = ldlcfh(agej, ldlcj, totalcj)
ldlcisnotfh = ldlcfh(agej, ldlcj, totalcj)
if item[y] == 0:
ldlcprob = ldlcisnotfh
outputldlc.loc[i] = [ldlcprob]
else:
ldlcprob = ldlcisfh
outputldlc.loc[i] = [ldlcprob]
y += 1
print(outputldlc)
return outputldlc
Related
I am new to pandas, and I'm learning it through its web documentation. I am facing issues in converting the following SAS code to pandas.
My SAS code:
data tmp2;
set tmp1;
retain group 0;
if _n_=1 and group_v1 = -1 then group = group_v1;
else if _n_=1 and group_v1 ne -1 then group=0;
else group=group+1;
run;
Note: In the above code group_v1 is a column from tmp1
There may be a more succinct and efficient way to do this in pandas, but this approach quite closely matches what SAS does internally when your code is run:
tmp1 = pd.DataFrame({"group_v1": [-1, 0, 1]})
def build_tmp2(tmp1):
# Contains the new rows for tmp2
_tmp2 = []
# Loop over the rows of tmp1 - like a data step does
for i, row in tmp1.iterrows():
# equivalent to the data statement - copy the current row to memory
tmp2 = row.copy()
# _N_ is equivalent to i, except i starts at zero in Pandas/Python
if i == 0:
# Create a new variable called pdv to contain values across loops
# This is equivalent to the Program Data Vector in SAS
pdv = {}
if row['group_v1'] == -1:
pdv['group'] = row['group_v1']
else:
pdv['group'] = 0
# Equivalent to both retain group and also group=group+1
pdv['group']+=1
# Copy the accumulating group variable to the target row
tmp2['group'] = pdv['group']
# Append the updated row to the list
_tmp2.append(tmp2.copy())
# After the loop has finished build the new DataFrame from the list
return pd.DataFrame(_tmp2)
build_tmp2(tmp1)
I am writing a python code where I have a condition which till the time it is true I want the calculations to happen and update the dataframe columns. However I am noticing that the dataframe is not getting updated and all the values are of the 1st iteration only. Can an expert guide on where I am going wrong. Below is my sample code -
'''
mbd_out_ub2 = mbd_out_ub1
mbd_out_ub2_len = len(mbd_out_ub2)
plt_mbd_c1_all = pd.DataFrame()
brd2c2_all = pd.DataFrame()
iterc=1
### plt_mbd_c >> this is the data frame with data before the loop starts
plt_mbd_c0 = plt_mbd_c.copy()
plt_mbd_c0 = plt_mbd_c0[plt_mbd_c0['UB_OUT']==1]
while (iterc < 10):
plt_mbd_c1 = plt_mbd_c0.copy()
brd2c2 = plt_mbd_c1.groupby('KEY1')['NEST_VAL_PER'].agg([('KEY1_CNT','count'),('PER1c', lambda x: x.quantile(0.75))]).reset_index()
brd2c2_all = brd2c2_all.append(brd2c2).reset_index(drop=True)
plt_mbd_c1 = pd.merge(plt_mbd_c1,brd2c2[['KEY1','PER1c']],on='KEY1', how='left')
del brd2c2, plt_mbd_c0
plt_mbd_c1['NEST_VAL_PER1'] = plt_mbd_c1['PER1c'] * (plt_mbd_c1['EVAL_LP_%'] / 100)
plt_mbd_c1['NEST_VAL_PER1'] = np.where((plt_mbd_c1['BRD_OUT_FLAG'] == 0),plt_mbd_c1['NEST_VAL'],plt_mbd_c1['NEST_VAL_PER1'] )
plt_mbd_c1['SALESC'] = plt_mbd_c1['NEST_VAL_PER1']/plt_mbd_c1['PROJR']/plt_mbd_c1['NEWPRICE']
plt_mbd_c1['C_SALES_C'] = np.where(plt_mbd_c1['OUT_FLAG'] == 1,plt_mbd_c1['SALESC'],plt_mbd_c1['SALESUNIT'])
plt_mbd_c1['NEST_VAL_PER'] = plt_mbd_c1['C_SALES_C'] * plt_mbd_c1['PROJR'] * plt_mbd_c1['NEWPRICE']
plt_mbd_c1['ITER'] = iterc
plt_mbd_c1_all = plt_mbd_c1_all.append(plt_mbd_c1).reset_index(drop=True)
plt_mbd_c1.drop(['PER1c'],axis=1,inplace=True)
plt_mbd_c0 = plt_mbd_c1.copy()
del plt_mbd_c1
print("iter = ",iterc)
iterc = iterc + 1
'''
So above I want to take 75th percentile of a column by KEY1 and do few calculations. The idea is after every iteration my 75th percentile will keep reducing as I am updating the same column with calculated value which would be lower then the current value (since it is based on 75th percentile). However when I check I find for all the iterations the values are same as the 1st iteration only. I have tried to delete the data frames, save to temp data frame, copy dataframe but non seem to be working.
Please help !!
Probably easy, but I am still learning.
I am creating a new column in dask dataframe where the value will come from after extracting the last four str characters of date column in str ddmmyyyy.
What I did:
have is a list of inv_years
extract the lst four characters of the string date
tried to define a function that if the extracted years are in the inv_years list, return 1 else 0 in a new column.
Issue: How do I write a working function or better in fewer lines a lambda function
def valid_yr(x):
inv_years = ['1921','1969','2026','2030','2041','2060','2062']
validity_year = ddf['string_ddmmyyyy'].str[-4:] #extract the last four to get the year
if validity_year.isin(inv_years):
x = 1
else:
x = 0
return x
#create a new column and apply function
ddf['validity_year']= ??? # what to write here?
A very grumpy way I could come up with is
inv_years = ['1921','1969','2026','2030','2041','2060','2062']
ddf['validity_year'] = ddf.apply(lambda row: 1 if row.string_ddmmyyyy[-4:] in inv_years else 0, axis=1)
or to try and get your approach working we initially modify your function a bit so as it's argument is a single row.
def valid_yr(row):
inv_years = ['1921','1969','2026','2030','2041','2060','2062']
validity_year = row.string_ddmmyyyy[-4:]
if validity_year in inv_years:
x = 1
else:
x = 0
return x
Now we can apply this function to all rows.
ddf['validity_year'] = ddf.apply(valid_yr, axis=1)
i'm trying to create multiple columns(couple of hundreds) using values within the same df. is there a more efficient way for me to create multiple columns in batches? below is an example where i have to manually input new column names jwrl2_rank.r1, jwrl2_rank.1r1,jwrl2_rank.2r1, etc.. attached to the formula.
i0, i1, i2 are the original column names
and rn is the value within the column.
i0='jwrl2_rank'
i1='jwrl2_rank.1'
i2='jwrl2_rank.2'
i3='jwrl2_rank.3'
i4='jwrl2_rank.4'
i5='jwrl2_rank.5'
i6='jwrl2_rank.6'
i7='jwrl2_rank.7'
rn=1
df['jwrl2_rank.r1']=((df.loc[(df[i0]==rn)&(df['result']==1),'timing'].sum())/(df.loc[(df[i0]==rn),i0].count()))-1
df['jwrl2_rank.1r1']=((df.loc[(df[i1]==rn)&(df['result']==1),'timing'].sum())/(df.loc[(df[i1]==rn),i1].count()))-1
df['jwrl2_rank.2r1']=((df.loc[(df[i2]==rn)&(df['result']==1),'timing'].sum())/(df.loc[(df[i2]==rn),i2].count()))-1
df['jwrl2_rank.3r1']=((df.loc[(df[i3]==rn)&(df['result']==1),'timing'].sum())/(df.loc[(df[i3]==rn),i3].count()))-1
df['jwrl2_rank.4r1']=((df.loc[(df[i4]==rn)&(df['result']==1),'timing'].sum())/(df.loc[(df[i4]==rn),i4].count()))-1
df['jwrl2_rank.5r1']=((df.loc[(df[i5]==rn)&(df['result']==1),'timing'].sum())/(df.loc[(df[i5]==rn),i5].count()))-1
df['jwrl2_rank.6r1']=((df.loc[(df[i6]==rn)&(df['result']==1),'timing'].sum())/(df.loc[(df[i6]==rn),i6].count()))-1
df['jwrl2_rank.7r1']=((df.loc[(df[i7]==rn)&(df['result']==1),'timing'].sum())/(df.loc[(df[i7]==rn),i7].count()))-1
many thanks. regards
Using a for loop should work.
Incrementing string value
By using string interpolation you could solve your problem. See here for a quick introduction. I am using f-strings in the example below.
base_name='jwrl2_rank'
MAX_NUMBER = 3
for i in range(1, MAX_NUMBER + 1):
new_name = f"{base_name}.{i}"
print(new_name)
>>>
jwrl2_rank.1
jwrl2_rank.2
jwrl2_rank.3
Example of for loop
base_name='jwrl2_rank'
MAX_NUMBER = 3
for i in range(MAX_NUMBER + 1):
current_iN = f"{base_name}.{i}"
new_col_name = f"{base_name}.{i}r1"
if i == 0: # compensate for missing zero in column name
current_iN = base_name
new_col_name = f"{base_name}.r1"
df[new_col_name]=((df.loc[(df[current_iN]==rn)&(df['result']==1),'timing'].sum())/(df.loc[(df[current_iN]==rn),current_iN].count()))-1
I am calculating a value of net balance from a condition and I want to store it in a new variable altogether. The new variable should store this calculated value as integer or a float and not as an array
I have tried the following code:
#variable = something if condition else something_else
mar_final_bal = x_start_bal+df2['credit_line']+df2['Net_Balance'] if
df2['month' == 'March-2016']
apr_final_bal = mar_final_bal+df2['credit_line']+df2['Net_Balance'] if
df2['month' == 'Apr-2016']
mar_final_bal and apr_final_bal are my two variables that I want to create using the conditions on the right side
It is evident that you are new to using Pandas. The syntax looks more pseudo-like than pandas code. IIUC, this is what you meant:
mar_final_bal = x_start_bal+df2.loc[df2['month'] == 'March-2016', 'credit_line'].sum() + df2.loc[df2['month'] == 'March-2016', 'Net_Balance'].sum()
apr_final_bal = mar_final_bal+df2.loc[df2['month'] == 'Apr-2016', 'credit_line'].sum() + df2.loc[df2['month'] == 'Apr-2016', 'Net_Balance'].sum()