Passing function name as string + Panda Dataframe + Azure Databricks - python

I got a function called 'changeUpper', i want call this function on a given column of Pandas DataFrame based on Metadata definitions.
Example in Metadata I record to call function changeUpper on PrimaryColumn (this holds name of the Column).
I wanted to do something like which will be dynamic:
for index, row in rulesPandas.iterrows():
sourcePandas[row['PrimaryColumn']] = sourcePandas[row['PrimaryColumn']].apply(ow['FunctionName'])
throw error: *changeUpper is an unknown string function*
Alternatively what am doing right now is as below which is not so flexible, whenever i add new function have to add another if condition.
for index, row in rulesPandas.iterrows():
if row['FunctionName'] == 'changeUpper':
sourcePandas[row['PrimaryColumn']] = sourcePandas[row['PrimaryColumn']].apply(changeUpper)

I got this working as :
for index, row in rulesPandas.iterrows():
func = eval(row['FunctionName'])
newcolumn = row['NewColumnName']
if(newcolumn is not None):
sourcePandas = sourcePandas.assign(**{f'{newcolumn}': sourcePandas[row['PrimaryColumn']].apply(func)})
else:
sourcePandas[row['PrimaryColumn']] = sourcePandas[row['PrimaryColumn']].apply(func)
What am trying to achieve is to execute assigned function against given column, this is defined in metadata.

Related

Assigning the same value to all Pyspark column elements using WithColumn

Here is my code :
for s, sub_direct in enumerate(os.listdir(path_csv1)):
for i, file in enumerate (glob.glob(path_csv1+"/"+sub_direct+"/*.csv")):
df_spa = spark.read.csv(file,header=True,sep=",")
df_spa = df_spa.withColumn("Batt_id", sub_direct)
#df=df.append(df_spa)
df = df.union(df_spa)
Based in the value of sub_direct I wil update my column df_spa ['Batt_id']
I got the next error and I could not understand how to solve it
I know it expectes a column but here I need to assign the same string to all the values of the column with a folder
Is it possible? lit does not worked for me
use lit() while passing a variable
from pyspark.sql import functions as F
df_spa = df_spa.withColumn("Batt_id", F.lit(sub_direct))

Beginner Python: how to update function to run multiple arguments through it

I have created a function that creates a pandas dataframe where I have created a new column that combines the first/middle/last name of an employee. I am then calling the function based on the python index(EmployeeID). I am able to run this function successfully for one employee. I am having trouble updating the function to be able to run multiple EmployeeIDs at once. Let's say I wanted to run 3 employee IDs through the function. How would I update this function to allow for that?
def getFullName(EmpID):
df = pd.read_excel('Employees.xls', 'Sheet0', index_col='EmployeeID', usecols=['EmployeeID','FirstName','MiddleName','LastName'] ,na_values=[""])
X = df[["FirstName","MiddleName","LastName"]]
df['EmployeeName'] = X.fillna('').apply(lambda x: x.LastName+", "+x.FirstName+" "+str(x.MiddleName), axis=1)
if EmpID in df.index:
rec=df.loc[EmpID,'EmployeeName']
print(rec)
else:
print("UNKNOWN")
In general, if you want an argument to be able to consist of one or more records, you can use a list or tuple to represent it.
In practice for this example, because python is dynamically typed and because the .loc function of the pandas Dataframes can also take a list of values as arguments, you dont have to change anything. Just pass a list of employee ids as EmpID.
Without knowing how the EmpIDs look like, it is hard to give an example.
But you can try it out, by calling your function with
getFullName(EmpID)
and with
getFullName([EmpID, EmpID])
The first call should print you the record once and the the second line should print you the record twice. You can replace EmpID with any working id (see df.index).
The documentation I liked above has some minimal examples to play around with.
PS: There is a bit of danger in passing a list to .loc. If you pass an EmpID that does not exist, pandas will currently only give a warning (in future version it will give a KeyError. For any unknown EmpID it will create a new row in the result with NaNs as values. From the documentation example:
df = pd.DataFrame([[1, 2], [4, 5], [7, 8]],
index=['cobra', 'viper', 'sidewinder'],
columns=['max_speed', 'shield'])
df.loc[['viper', 'sidewinder']]
Will return
max_speed shield
viper 4 5
sidewinder 7 8
Calling it with missing indices:
print(df.loc[['viper', 'does not exist']])
Will produce
max_speed shield
viper 4.0 5.0
does not exist NaN NaN
You could add in an array of EmpIDs.
empID_list = [empID01, empID02, empID03]
Then you would need to use a for loop:
for empID in empID_list:
doStuff()
Or you just use your fuction as the function in the for loop.
for empID in empID_list:
getFullName(empID)
Let's say you have this list of employee IDs:
empIDs = [empID1, empID2, empID3]
You need to then pass this list as an argument instead of a single employee ID.
def getFullName(empIDs):
df = pd.read_excel('Employees.xls', 'Sheet0', index_col='EmployeeID', usecols=['EmployeeID','FirstName','MiddleName','LastName'] ,na_values=[""])
X = df[["FirstName","MiddleName","LastName"]]
df['EmployeeName'] = X.fillna('').apply(lambda x: x.LastName+", "+x.FirstName+" "+str(x.MiddleName), axis=1)
for EmpID in empIDs:
if EmpID in df.index:
rec=df.loc[EmpID,'EmployeeName']
print(rec)
else:
print("UNKNOWN")
One way or another the if EmpID in df.index: will need to be rewritten. I suggest you pass a list called employee_ids as the input, then do the following (the first two lines are to wrap a single ID in a list, it is only needed if you still want to be able to pass a single ID):
if not isinstance(employee_ids, list):
employee_ids = [employee_ids] # this ensures you can still pass single IDs
rec=df.reindex(employee_ids).EmployeeName.dropna()
In the old days, df.loc would accept missing labels and just not return anything, but in recent versions it raises an error. reindex will give you a row for every ID in employee_ids, with NaN as the value if the ID wasn't in the index. We therefore select the column EmployeeName and then drop the missing values with dropna.
Now, the only thing left to do is handle the output. The DataFrame has a (boolean) attribute called empty, which can be used to check whether any IDs were found. Otherwise we'll want to print the values of recs, which is a Series.
Thus:
def getFullName(employee_ids):
df = pd.read_excel('Employees.xls', 'Sheet0', index_col='EmployeeID', usecols=['EmployeeID','FirstName','MiddleName','LastName'] ,na_values=[""])
X = df[["FirstName","MiddleName","LastName"]]
df['EmployeeName'] = X.fillna('').apply(lambda x: x.LastName+", "+x.FirstName+" "+str(x.MiddleName), axis=1)
if not isinstance(employee_ids, list):
employee_ids = [employee_ids] # this ensures you can still pass single IDs
rec=df.reindex(employee_ids).EmployeeName.dropna()
if rec.empty:
print("UNKNOWN")
else:
print(rec.values)
(as an aside, you may like to know that a python convention is to use snake_case for function and variable names and CamelCase for class names)

Why are my variable not accessible after a function?

I can't figure out why my function isn't providing the changes to the variables after I execute the function. Or why the variables are accessible after the function. I'm provided a dataframe and telling the fucntion the column to compare. I want the function to include the matching values are the original dataframe and create a separate dataframe that I can see just the matches. When I run the code I can see the dataframe and matching dataframe after running the function, but when I tried to call the matching dataframe after python doesn't recognize the variable as define and the original dataframe isn't modified when I look at it again. I've tried to call them both as global variables at the beginning of the function, but that didn't work either.
def scorer_tester_function(dataframe, score_type, source, compare, limit_num):
match = []
match_index = []
similarity = []
org_index = []
match_df = pd.DataFrame()
for i in zip(source.index, source):
position = list(source.index)
print(str(position.index(i[0])) + " of " + str(len(position)))
if pd.isnull(i[1]):
org_index.append(i[0])
match.append(np.nan)
similarity.append(np.nan)
match_index.append(np.nan)
else:
ratio = process.extract( i[1], compare, limit=limit_num,
scorer=scorer_dict[score_type])
org_index.append(i[0])
match.append(ratio[0][0])
similarity.append(ratio[0][1])
match_index.append(ratio[0][2])
match_df['org_index'] = pd.Series(org_index)
match_df['match'] = pd.Series(match)
match_df['match_index'] = pd.Series(match_index)
match_df['match_score'] = pd.Series(similarity)
match_df.set_index('org_index', inplace=True)
dataframe = pd.concat([dataframe, match_df], axis=1)
return match_df, dataframe
I'm calling the function list this:
scorer_tester_function(df_ven, 'WR', df_ven['Name 1'].sample(2), df_emp['Name 2'], 1)
My expectation is that I can access match_df and def_ven and I would be able to see and further manipulate these variables, but when called the original dataframe df_ven is unchanged and match_df returns a variable not defined error.
return doesn't inject local variables into the caller's scope; it makes the function call evaluate to their values.
If you write
a, b = scorer_tester_function(df_ven, 'WR', df_ven['Name 1'].sample(2), df_emp['Name 2'], 1)
then a will have the value of match_df from inside the function and b will have the value of dataframe, but the names match_df and dataframe go out of scope after the function returns; they do not exist outside of it.

How to define a variable amount of columns in python pandas apply

I am trying to add columns to a python pandas df using the apply function.
However the number of columns to be added depend on the output of the function
used in the apply function.
example code:
number_of_columns_to_be_added = 2
def add_columns(number_of_columns_to_be_added):
df['n1'],df['n2'] = zip(*df['input'].apply(lambda x : do_something(x, number_of_columns_to_be_added)))
Any idea on how to define the ugly column part (df['n1'], ..., df['n696969']) before the = zip( ... part programatically?
I'm guessing that the output of zip is a tuple, therefore you could try this:
temp = zip(*df['input'].apply(lambda x : do_something(x, number_of_columns_to_be_added)))
for i, value in enumerate(temp, 1):
key = 'n'+str(i)
df[key] = value
temp will hold the all the entries and then you iterate over tempto assign the values to your dict with your specific keys. Hope this matches your original idea.

For Looping error in pyspark

I am facing the following problem:
I have a list which I need to compare with the elements of a column in a dataframe(acc_name). I am using the following looping function but it only returns me 1 record when it should provide me 30.
Using Pyspark
bs_list =
['AC_E11','AC_E12','AC_E13','AC_E135','AC_E14','AC_E15','AC_E155','AC_E157',
'AC_E16','AC_E163','AC_E165','AC_E17','AC_E175','AC_E180','AC_E185', 'AC_E215','AC_E22','AC_E225','AC_E23','AC_E23112','AC_E235','AC_E245','AC_E258','AC_E25','AC_E26','AC_E265','AC_E27','AC_E275','AC_E31','AC_E39','AC_E29']
for i in bs_list:
bs_acc1 = (acc\
.filter(i == acc.acc_name)
.select(acc.acc_name,acc.acc_description)
)
the bs_list elements are subset of acc_name column. I am trying to create a new DF which will have the following 2 Columns acc_name, acc_description. It will only contain details of the value of elements present in list bs_list
Please let me know where I am going wrong?
Thats because , in loop everytime you filter on i, you are creating a new dataframe bs_acc1. So it must be showing you only 1 row belonging to last value in bs_list i.e. row for 'AC_E29'
one way to do it is repeat union with itself, so previous results also remain in the dataframe like -
# create a empty dataframe, give schema which is appropriate to your data below
bs_acc1 = sqlContext.createDataFrame(sc.emptyRDD(), schema)
for i in bs_list:
bs_acc1 = bs_acc1.union(
acc\
.filter(i == acc_fil.acc_name)
.select(acc.acc_name,acc.acc_description)
)
better way is not doing loop at all -
from pyspark.sql.functions import *
bs_acc1 = acc.where(acc.acc_name.isin(bs_list))
You can also transform bs_list to dataframe with column acc_name and then just do join to acc dataframe.
bs_rdd = spark.sparkContext.parallelize(bs_list)
bs_df = bs_rdd.map(lambda x: Row(**{'acc_name':x})).toDF()
bs_join_df = bs_df.join(acc, on='acc_name')
bs_join_df.show()

Categories