Python pandas column operations - python

I'm trying to do some columnar operations on a dataframe and I'm stuck at one point. I'm new to pandas and now I'm unable to figure how to do this.
So wherever there is a "Yes" value in "Prevous_Line_Has_Br" buffer should be added to the "OldTop" value but whenever there is a "No" in between it should stop incrementing, take the previous row value and start incrementing when there is a "Yes" again.
I have tried something like this
temp_df["CheckBr"] = temp_df["Prevous_Line_Has_Br"].shift(1)
temp_df["CheckBr"] = temp_df["CheckBr"].fillna("dummy")
temp_df.insert(0, 'New_ID', range(0, 0 + len(temp_df)))
temp_df["NewTop"] = "NoIncr"
temp_df["MyTop"] = 0
temp_df.loc[(temp_df["Prevous_Line_Has_Br"] == "Yes") & (temp_df["CheckBr"] == "Yes"), "NewTop"] = "Incr"
temp_df.loc[(temp_df["Prevous_Line_Has_Br"] == "Yes") & (temp_df["CheckBr"] == "No"), "NewTop"] = "Incr"
temp_df.loc[(temp_df["Prevous_Line_Has_Br"] == "Yes") & (temp_df["CheckBr"] == "dummy"), "NewTop"] = "Incr"
temp_df.loc[(temp_df["NewTop"]=="Incr"),"MyTop" ] = new_top + (temp_df.New_ID * temp_df.buffer)
temp_df.loc[(temp_df["CheckBr"] == "Yes") & (temp_df["MyTop"] == 0), "MyTop"] = temp_df["MyTop"].shift(1)
This is giving me the following output to achieve the same without the for loop:
Can someone please help achieve the values in the original dataframe using pandas?
This is what I want to achieve finally..

This would be fairly easy to do if you moved away from pandas, and treated the columns as just lists. If you want to still use the apply method, you can use to decorator to keep track of the last row.
def apply_func_decorator(func):
prev_row = {}
def wrapper(curr_row, **kwargs):
val = func(curr_row, prev_row)
prev_row.update(curr_row)
prev_row[new_col] = val
return val
return wrapper
#apply_func_decorator
def add_buffer_and_top(curr_row, prev_row):
if curr_row.Prevous_Line_Has_Br == 'Yes':
if prev_row:
return curr_row.buffer + prev_row['NewTop']
return curr_row.buffer + prev_row['OldTop']
return prev_row['NewTop']
temp_df['NewTop'] = 0
temp_df['NewTop'] = temp_df.apply(add_buffer_and_top, axis=1)

This is how I achieved the output I desired
m = temp_df['Prevous_Line_Has_Br'].eq('Yes')
temp_df['New_ID'] = m.cumsum().where(m,np.nan)
temp_df["New_ID"] = temp_df["New_ID"].ffill()
temp_df["Top"] = temp_df['Old_Top'] + (temp_df['New_ID'] * temp_df['buffer'])
Column New_ID was incremented only when there was a value 'Yes' in column Previous_Line_Has_br.

Related

Increase Loop Speed (Pandas dataframe)

Good Afternoon
I have created a function that assign a value depending on the previous row in a dataframe:
#Function to calculate cycles
def new_cycle (dfTickets_CC, cicle, id, prev_id,prev_status):
global new_cicle
if cicle is not None:
new_cicle = cicle
elif id != prev_id:
if not dfTickets_CC.loc[dfTickets_CC['Ticket_ID'].isin([id])].empty:
new_cicle = dfTickets_CC[dfTickets_CC['Ticket_ID'] == id]['Cicle_lastNr'].values[0] + 1
else:
new_cicle = 1
elif id == prev_id:
if prev_status == "Completed":
new_cicle = int(new_cicle)
new_cicle += 1
else:
new_cicle = new_cicle
return str(new_cicle).split(".")[0]
I call the function and iter the dataframe :
#Step 4, Calculating new cicle
ncicle = []
for i in range(len(dfCompilate.index)):
if i == 0:
ncicle.append(new_cycle(dfTickets_CC,dfCompilate['Cicle'].values[i],dfCompilate['Ticket_ID'].values[i],None,None))
else:
ncicle.append(new_cycle(dfTickets_CC,dfCompilate['Cicle'].values[i],dfCompilate['Ticket_ID'].values[i],dfCompilate['Ticket_ID'].values[i-1],dfCompilate['Status'].values[i-1]))
dfCompilate['New_cicle'] = ncicle
Problem is that, even though it works correctly, it takes a lot of time... For instance, it takes 2 hours to process a dataframe with 500,000 rows
Does anybody know how to make it faster?
Thanks in advance

How to create a function based on another dataframe column being True?

I have a dataframe shown below:
Name X Y
0 A False True
1 B True True
2 C True False
I want to create a function for example:
example_function("A") = "A is in Y"
example_function("B") = "B is in X and Y"
example_function("C") = "C is in X"
This is my code currently (incorrect and doesn't look very efficient):
def example_function(name):
for name in df['Name']:
if df['X'][name] == True and df['Y'][name] == False:
print(str(name) + "is in X")
elif df['X'][name] == False and df['Y'][name] == True:
print(str(name) + "is in Y")
else:
print(str(name) + "is in X and Y")
I eventually want to add more Boolean columns so it needs to be scalable. How can I do this? Would it be better to create a dictionary, rather than a dataframe?
Thanks!
If you really want a function you could do:
def example_function(label):
s = df.set_index('Name').loc[label]
l = s[s].index.to_list()
return f'{label} is in {" and ".join(l)}'
example_function('A')
'A is in Y'
example_function('B')
'B is in X and Y'
You can also compute all the solutions as dictionary:
s = (df.set_index('Name').replace({False: pd.NA}).stack()
.reset_index(level=0)['Name']
)
out = s.index.groupby(s)
output:
{'A': ['Y'], 'B': ['X', 'Y'], 'C': ['X']}
I think you can stay with a DataFrame, the same output can be obtained with a function like this:
def func (name, df):
# some checks to verify that the name is actually in the df
occurrences_name = np.sum(df['Name'] == name)
if occurrences_name == 0:
raise ValueError('Name not found')
elif occurrences_name > 1:
raise ValueError('More than one name found')
# get the index corresponding to the name you're looking for
# and select the corresponding row
index = df[df['Name'] == name].index[0]
row = df.drop(['Name'], axis=1).iloc[index]
outstring = '{} is in '.format(name)
for i in range(len(row)):
if row[i] == True:
if i != 0: outstring += ', '
outstring += '{}'.format(row.index[i])
return outstring
of course you can adapt this to the specific shape of your df, I'm assuming that the column containing names is actually 'Name'.

How can I create function like that?

def f(s):
if s['col1'] == 2:
return s['new_column'] = s['col1']
elif s['col2'] == 3:
return s['new_column'] = s['col2']
else:
return s['new_column'] = s['col3']
This did not worked, I know np.select but I have different nested ifs and I must create a column with so many conditions. How can I do it?

replacing value in specific columns in datafram by using python

this is code to calculate the weight of evidance
#good is zero bad is one
Weight of Evidance function for discrete unordered variables
df = pd.concat([df[the_categroical_name], My_target], axis = 1)
df = pd.concat([df.groupby(df.columns.values[0], as_index = False)[df.columns.values[1]].count(),
df.groupby(df.columns.values[0], as_index = False)[df.columns.values[1]].mean()], axis = 1)
df = df.iloc[:, [0, 1, 3]]
df.columns = [df.columns.values[0], 'Number_of_observation', 'Probation_good_taxPayer']
df['prop_Number_of_observation'] = df['Number_of_observation'] / df['Number_of_observation'].sum()
df['N_good'] = df['Probation_good_taxPayer'] * df['Number_of_observation']
df['n_bad'] = (1 - df['Probation_good_taxPayer']) * df['Number_of_observation']
df['prop_n_good'] = df['N_good'] / df['N_good'].sum()
df['prop_of_bad'] = df['n_bad'] / df['n_bad'].sum()
df['WoE'] = np.log(df['prop_n_good'] / df['prop_of_bad'])
df['PD']= ((df['N_good'])/(df['n_bad'] + df['N_good']))
df = df.sort_values(['WoE'])
df = df.reset_index(drop = True)
#df['diff_Probation_good_taxPayer'] = df['Probation_good_taxPayer'].diff().abs()
#df['diff_WoE'] = df['WoE'].diff().abs()
df['IV'] = (df['prop_n_good'] - df['prop_of_bad']) * df['WoE']
df['IV'] = df['IV'].sum()
return df
df_BUSINESS_CATEGORY = Weight_of_evidance(df_input, 'BUSINESS_CATEGORY', df_Label)
# We execute the function we defined with the necessary arguments: a dataframe, a string, and a dataframe.
# We store the result in a dataframe.
df_BUSINESS_CATEGORY
So for now if i want to replace any value in the business_category for instance A withtheir value in the column Woe is -0978021 stc for now i am using for loop like this below code
def flag_df_ISIC_4_ARAB(df_input):
if (df_input['BUSINESS_CATEGORY'] == 'A'):
return '-0.978021'
elif (df_input['BUSINESS_CATEGORY'] == 'اB'):
return '-0.977854'
elif (df_input['BUSINESS_CATEGORY'] == 'C'):
return '0.082918'
elif (df_input['BUSINESS_CATEGORY'] == 'D'):
return '0.772306'
elif (df_input['BUSINESS_CATEGORY'] == 'H'):
return '-0.176700'
elif (df_input['BUSINESS_CATEGORY'] == 'أخرى'):
return '0.955446'
else:
return '0'
df_input['BUSINESS_CATEGORY'] = df_input.apply(flag_df_ISIC_4_ARAB, axis = 1).astype(str)```
is there another way to replace the Woe with out using for loop
Create dictionary first, pass to Series.map and replace non matched values to '0':
d = {'A':'-0.978021','اB':'-0.977854', 'C':'0.082918',
'D':'0.772306', 'H': '-0.176700', 'أخرى': '0.955446'}
df_input['BUSINESS_CATEGORY'] = df_input['BUSINESS_CATEGORY'].map(d).fillna('0')

Undefined dictionaries in my main function

def monday_availability(openhours_M): #openhours_M = number hours pool is open
hourone = int(input('Input the first hour in the range of hours the guard can work'))
hourlast = int(input('Input the last hour in the range of hours the guard'))
hour = 1
availability_M = []
while hour <= openhours_M:
if hour >= hourone & hour <= hourlast:
availability_M.append(1)
else:
availability_M.append(0)
return availability_M
Above is a function gathering the availability of a lifeguard and storing the hours a guard can work as a 1 in availability list or a 0 if they cannot. I return this list with the intent of adding it to a dictionary in the function below.
def guard_availability(guards, openhours_M, openhours_T, openhours_W,
openhours_R, openhours_F, openhours_S, openhours_Su):
continueon = 1
while continueon == 1:
name = input('Input guards name of lifeguard to update availability' )
availability = {}
days = {}
if openhours_M != 0:
monday_availability(openhours_M)
if openhours_T != 0:
tuesday_availability(openhours_T)
if openhours_W != 0:
wednesday_availability(openhours_W)
if openhours_R != 0:
thursday_availability(openhours_R)
if openhours_F != 0:
friday_availability(openhours_F)
if openhours_S != 0:
saturday_availability(openhours_S)
if openhours_Su != 0:
sunday_availability(openhours_Su)
days['Monday'] = availability_M
days['Tuesday'] = availability_T
days['Wednesday'] = availability_W
days['Thursday'] = availability_R
days['Friday'] = availability_F
days['Saturday'] = availability_S
days['Sunday'] = availability_Su
availability[name]= days
continueon = input('Enter 1 to add availability for another guard, 0 to stop: ')
return days
When I run this code, I get an error saying my availability lists are undefined even though I returned them in the functions above. Where is the error in my understanding of returning in functions, and how can I remedy this problem.
monday_availability(openhours_M) returns a value.
Returning a variable does not assign it to anything outside the scope of that function.
If you renamed return availability_M to use return foo and update the other uses only within that function accordingly, would the error make more sense?
Now, actually capture the result
availability_M = monday_availability(openhours_M)
Or even just
days['Monday'] = monday_availability(openhours_M)
Also, not seeing how that function has anything to do with Mondays. Try to write DRY code
You return the dic value in your function but don't assign it to any variable. You should do it like this:
if openhours_M != 0:
availability_M=monday_availability(openhours_M)
if openhours_T != 0:
availability_T=tuesday_availability(openhours_T)
if openhours_W != 0:
availability_W=wednesday_availability(openhours_W)
if openhours_R != 0:
availability_R=thursday_availability(openhours_R)
if openhours_F != 0:
availability_F=friday_availability(openhours_F)
if openhours_S != 0:
availability_S=saturday_availability(openhours_S)
if openhours_Su != 0:
availability_Su=sunday_availability(openhours_Su)

Categories