Why are my variable not accessible after a function? - python

I can't figure out why my function isn't providing the changes to the variables after I execute the function. Or why the variables are accessible after the function. I'm provided a dataframe and telling the fucntion the column to compare. I want the function to include the matching values are the original dataframe and create a separate dataframe that I can see just the matches. When I run the code I can see the dataframe and matching dataframe after running the function, but when I tried to call the matching dataframe after python doesn't recognize the variable as define and the original dataframe isn't modified when I look at it again. I've tried to call them both as global variables at the beginning of the function, but that didn't work either.
def scorer_tester_function(dataframe, score_type, source, compare, limit_num):
match = []
match_index = []
similarity = []
org_index = []
match_df = pd.DataFrame()
for i in zip(source.index, source):
position = list(source.index)
print(str(position.index(i[0])) + " of " + str(len(position)))
if pd.isnull(i[1]):
org_index.append(i[0])
match.append(np.nan)
similarity.append(np.nan)
match_index.append(np.nan)
else:
ratio = process.extract( i[1], compare, limit=limit_num,
scorer=scorer_dict[score_type])
org_index.append(i[0])
match.append(ratio[0][0])
similarity.append(ratio[0][1])
match_index.append(ratio[0][2])
match_df['org_index'] = pd.Series(org_index)
match_df['match'] = pd.Series(match)
match_df['match_index'] = pd.Series(match_index)
match_df['match_score'] = pd.Series(similarity)
match_df.set_index('org_index', inplace=True)
dataframe = pd.concat([dataframe, match_df], axis=1)
return match_df, dataframe
I'm calling the function list this:
scorer_tester_function(df_ven, 'WR', df_ven['Name 1'].sample(2), df_emp['Name 2'], 1)
My expectation is that I can access match_df and def_ven and I would be able to see and further manipulate these variables, but when called the original dataframe df_ven is unchanged and match_df returns a variable not defined error.

return doesn't inject local variables into the caller's scope; it makes the function call evaluate to their values.
If you write
a, b = scorer_tester_function(df_ven, 'WR', df_ven['Name 1'].sample(2), df_emp['Name 2'], 1)
then a will have the value of match_df from inside the function and b will have the value of dataframe, but the names match_df and dataframe go out of scope after the function returns; they do not exist outside of it.

Related

What argument is being passed to this function?

Considering a function (apply_this_function) that will be applied to a dataframe:
# our dataset
data = {"addresses": ['Newport Beach, California', 'New York City', 'London, England', 10001, 'Sydney, Au']}
# create a throw-away dataframe
df_throwaway = df.copy()
def apply_this_function(passed_row):
passed_row['new_col'] = True
passed_row['added'] = datetime.datetime.now()
return passed_row
df_throwaway.apply(apply_this_function, axis=1) # axis=1 is important to use the row itself
In df_throway.appy(.), where does the function take the "passed_row" parameter? Or what value is this function taking? My assumption is that by the structure of apply(), the function takes values from row i starting at 1?
I am referring to the information obtained here
When you apply a function to a DataFrame with axis=1, then
this function is called for each row from the source DataFrame
and by convention its parameter is called row.
In your case this function returns (from each call) the original
row (actually a Series object), with 2 new elements added.
Then apply method collects these rows, concatenates them
and the result is a DataFrame with 2 new columns.
You wrote takes values from row i starting at 1. I would change it to
takes values from each row.
Writing starting at 1 can lead to misunderstandings, since when your
DataFrame has a default index, its values start from 0 (not from 1).
In addition, I would like to propose 2 corrections to your code:
Create your DataFrame passing data (your code sample does not
contain creation of df):
df_throwaway = pd.DataFrame(data)
Define your function as:
def apply_this_function(row):
row['new_col'] = True
row['added'] = pd.Timestamp.now()
return row
i.e.:
name the parameter as just row (everybody knows that this row
has been passed by apply method),
instead of datetime.datetime.now() use pd.Timestamp.now()
i.e. a native pandasonic type and its method.

Passing function name as string + Panda Dataframe + Azure Databricks

I got a function called 'changeUpper', i want call this function on a given column of Pandas DataFrame based on Metadata definitions.
Example in Metadata I record to call function changeUpper on PrimaryColumn (this holds name of the Column).
I wanted to do something like which will be dynamic:
for index, row in rulesPandas.iterrows():
sourcePandas[row['PrimaryColumn']] = sourcePandas[row['PrimaryColumn']].apply(ow['FunctionName'])
throw error: *changeUpper is an unknown string function*
Alternatively what am doing right now is as below which is not so flexible, whenever i add new function have to add another if condition.
for index, row in rulesPandas.iterrows():
if row['FunctionName'] == 'changeUpper':
sourcePandas[row['PrimaryColumn']] = sourcePandas[row['PrimaryColumn']].apply(changeUpper)
I got this working as :
for index, row in rulesPandas.iterrows():
func = eval(row['FunctionName'])
newcolumn = row['NewColumnName']
if(newcolumn is not None):
sourcePandas = sourcePandas.assign(**{f'{newcolumn}': sourcePandas[row['PrimaryColumn']].apply(func)})
else:
sourcePandas[row['PrimaryColumn']] = sourcePandas[row['PrimaryColumn']].apply(func)
What am trying to achieve is to execute assigned function against given column, this is defined in metadata.

How to define a variable amount of columns in python pandas apply

I am trying to add columns to a python pandas df using the apply function.
However the number of columns to be added depend on the output of the function
used in the apply function.
example code:
number_of_columns_to_be_added = 2
def add_columns(number_of_columns_to_be_added):
df['n1'],df['n2'] = zip(*df['input'].apply(lambda x : do_something(x, number_of_columns_to_be_added)))
Any idea on how to define the ugly column part (df['n1'], ..., df['n696969']) before the = zip( ... part programatically?
I'm guessing that the output of zip is a tuple, therefore you could try this:
temp = zip(*df['input'].apply(lambda x : do_something(x, number_of_columns_to_be_added)))
for i, value in enumerate(temp, 1):
key = 'n'+str(i)
df[key] = value
temp will hold the all the entries and then you iterate over tempto assign the values to your dict with your specific keys. Hope this matches your original idea.

Not deleting rows as part of function - python

Please keep in mind i am coming from an R background (quite novice as well).
I am trying to create a UDF to format a data.frame df in Python, according to some defined rules. The first part deletes the first 4 rows of the data.frame and the second adds my desired column names. My function looks like this:
def dfFormatF(x):
#Remove 4 first lines
x = x.iloc[4:]
#Assign column headers
x.columns = ['Name1', 'Name2', 'Name3']
dfFormatF(df)
When i run it like this, its not working (neither dropping the first rows nor renaming). When i remove the x=x.iloc[4:], the second part x.columns = ['Name1', 'Name2', 'Name3'] is working properly and the column names are renamed. Additionally, if i run the removal outside the function, such as:
def dfFormatF(x):
#Assign column headers
x.columns = ['Name1', 'Name2', 'Name3']
df=df.iloc[4:]
dfFormatF(df)
before i call my function i get the full expected result (first removal of the first rows and then the desired column naming).
Any ideas as to why it is not working as part of the function, but it does outside of it?
Any help is much appreciated.
Thanks in advance.
The issue here is that the changes only inside the scope of dfFormatF(). Once you exit that function, all changes are lost because you do not return the result and you do not assign the result to something in the module-level scope. It's worth taking a step back to understand this in a general sense (this is not a Pandas-specific thing).
Instead, pass your DF to the function, make the transformations you want to that DF, return the result, and then assign that result back to the name you passed to the function.
Note This is a big thing in Pandas. What we emulate here is the inplace=True functionality. There are lots of things you can do to DataFrames and if you don't use inplace=True then those changes will be lost. If you stick with the default inplace=False then you must assign the result back to a variable (with the same or a different name, up to you).
import pandas as pd
starting_df = pd.DataFrame(range(10), columns=['test'])
def dfFormatF(x):
#Remove 4 first lines
x = x.iloc[4:]
#Assign column headers
x.columns = ['Name1']
print('Inside the function')
print(x.head())
return x
dfFormatF(starting_df)
print('Outside the function')
print(starting_df) # Note, unchanged
# Take 2
starting_df = dfFormatF(starting_df)
print('Reassigning changes back')
print starting_df.head()

How to pass a local variable to a user-defined function in pandas?

I am processing a pandas dataframe df. Depending on certain conditions within df, I want to pass df (1) either to the user defined function filter() or (2) to calculate_Result().
The code for the condition is:
if df.Col1.str.contains("yes").sum() > 0:
df = filter_df
filter() # Loop
else:
df = calculate_df
calculate_result() # End
However, I am yielding a UnboundLocalError: local variable referenced before assignment.
I keep getting this error also when I put the df = calculate_df & df = filter_df assignments in the function definitions itself:
def filter():
df = filter_df
...
and
def calculate_results():
df = calculate_df()
...
no matter where I put the assignments, I am yielding UnboundLocalError: local variable referenced before assignment.
How to correctly pass a local variable to a user defined function in pandas?
This isn't to do with passing a variable to a function.
You have used the name df to refer both to a global module - in df.Col1... and as the result of filter_df. That causes the error you see.
Use a different name for the local variable.

Categories