How to pass list to UserDefinedFunction (UDF) in pyspark

How to pass list to UserDefinedFunction (UDF) in pyspark - python

I need to pass a list as arguments for a certain UDF I have in pyspark. Example:
def cat(mine,mine2):
if mine is not None and mine2 is not None:
return "2_"+mine+"_"+mine2
udf_cat = UserDefinedFunction(cat, "string")
l = ["COLUMN1","COLUMN2"]
df = df.withColumn("NEW_COLUMN", udf_cat(l))
But I always get an error.

After a while, I figured out that all I need is to pass the list using the character '*' before it. Example:
df = df.withColumn("NEW_COLUMN", udf_cat(*l))
That way, it will work.

Related

Python Functions: Pass parameter as string and reference?

I have a knot in my head. I wasn't even sure what to google for. (Or how to formulate my title)
I want to do the following: I want to write a function that takes a term that occurs in the name of a .csv, but at the same time I want a df to be named after it.
Like so:
def read_data_into_df(name):
df_{name} = pd.read_csv(f"file_{name}.csv")
Of course the df_{name} part is not working. But I hope you get the idea.
Is this possible without hard coding?
Thanks!

IIUC, you can use globals :
def read_data_into_df(name):
globals()[f"df_{name}"] = pd.read_csv(f"file_{name}.csv")

If I were you I would create a dictionary and create keys with
dictionary = f"df_{name}: {whatever_you_want}"

If there are only a couple of dataframes, just accept the minimal code repetition:
def read_data_into_df(name):
return pd.read_csv(f"file_{name}.csv")
...
df_ham = read_data_into_df('ham')
df_spam = read_data_into_df('spam')
df_bacon = read_data_into_df('bacon')
...
# Use df_ham, df_spam and df_bacon
If there's a lot of them, or the exact data frames are generated, I would use a dictionary to keep track of the dataframes:
dataframes = {}
def read_data_into_df(name):
return pd.read_csv(f"file_{name}.csv")
...
for name in ['ham', 'spam', 'bacon']:
dataframes[name] = read_data_into_df('name')
...
# Use dataframes['ham'], dataframes['spam'] and dataframes['bacon']
# Or iterate over dataframes.values() or dataframes.items()!

Path inside the dictionary from the variable

I have a code:
def replaceJSONFilesList(JSONFilePath, JSONsDataPath, newJSONData):
JSONFileHandleOpen = open(JSONFilePath, 'r')
ReadedJSONObjects = json.load(JSONFileHandleOpen)
JSONFileHandleOpen.close()
ReadedJSONObjectsModifyingSector = ReadedJSONObjects[JSONsDataPath]
for newData in newJSONData:
ReadedJSONObjectsModifyingSector.append(newData)
JSONFileHandleWrite = open(JSONFilePath, 'w')
json.dump(ReadedJSONObjects, JSONFileHandleWrite)
JSONFileHandleWrite.close()
def modifyJSONFile(Path):
JSONFilePath = '/path/file'
JSONsDataPath = "['first']['second']"
newJSONData = 'somedata'
replaceJSONFilesList(JSONFilePath, JSONsDataPath, newJSONData)
Now I have an error:
KeyError: "['first']['second']"
But if I try:
ReadedJSONObjectsModifyingSector = ReadedJSONObjects['first']['second']
Everything is okay.
How I should send the path to the list from the JSON's dictionary — from one function to other?

You cannot pass language syntax elements as if they were data strings. Similarly, you could not pass the string "2 > 1 and False", and expect the function to be able to insert that into an if condition.
Instead, extract the data items and pass them as separate strings (which matches their syntax in the calling routine), or as a tuple of strings. For instance:
JSONsDataPath = ('first', 'second')
...
Then, inside the function ...
ReadedJSONObjects[JSONsDataPath[0]][JSONsDataPath[1]]
If you have a variable sequence of indices, then you need to write code to handle that case; research that on Stack Overflow.
The iterative way to handle an unknown quantity of indices is like this:
obj = ReadedJSONObjects
for index in JSONsDataPath:
obj = obj[index]

repeat function to extract similar information

Here is how I use pandas to open and read json file. I really appreciate the power of pandas :)
import pandas as pd
df = pd.read_json("https://datameetgeobk.s3.amazonaws.com/cftemplates/EyeOfCustomer.json")
def mytype(mydict):
try:
if mydict["Type"]:
return mydict["Type"]
except:
pass
df["myParametersType"] = df.Parameters.apply(lambda x: mytype(x))
The problem is that I need "Description" and "Default" values also along with "Type" strings. I have already written a function to extract types as mentioned above. Do I really need to write 2 more functions as shown below?
def mydescription(mydict):
try:
if mydict["Description"]:
return mydict["Description"]
except:
pass
def mydefault(mydict):
try:
if mydict["Default"]:
return mydict["Default"]
except:
pass
df["myParametersDescription"] = df.Parameters.apply(lambda x: mydescription(x))
df["myParametersDefault"] = df.Parameters.apply(lambda x: mydefault(x))
And how will I handle it if the dictionary contains more than 3 keys?
The final table should look something like this...
df.iloc[:, -3:].dropna(how="all")
myParametersType myParametersDescription myParametersDefault
pInstanceKeyName AWS::EC2::KeyPair::KeyName The name of the private key to use for SSH acc... None
pTwitterTermList String List of terms for twitter to listen to 'your', 'search', 'terms', 'here'
pTwitterLanguages String List of languages to use for the twitter strea... 'en'
pTwitterAuthConsumerKey String Consumer key for access twitter None
pTwitterAuthConsumerSecret String Consumer Secret for access twitter None
pTwitterAuthToken String Access Token Secret for calling twitter None
pTwitterAuthTokenSecret String Access Token Secret for calling twitter None
pApplicationName String Name of the application deploying for the EyeO... EyeOfCustomer
pVpcCIDR String Please enter the IP range (CIDR notation) for ... 10.193.0.0/16
pPublicSubnet1CIDR String Please enter the IP range (CIDR notation) for ... 10.193.10.0/24

You can pass new parameter to function:
def func(mydict, val):
try:
if mydict[val]:
return mydict[val]
except:
pass
df["myParametersType"] = df.Parameters.apply(lambda x: func(x, 'Type'))
df["myParametersDescription"] = df.Parameters.apply(lambda x: func(x, 'Description'))
df["myParametersDefault"] = df.Parameters.apply(lambda x: func(x, 'Default'))
df = df.iloc[:, -3:].dropna(how="all")

By making each row a pd.Series, you can create dataframe for every key and value in each dictionary.
like this :
get_df = df['Parameters'].apply(lambda x : pd.Series(x)).drop(0, axis=1) # NaN is colnam 0
get_df.columns = ['P_' + col for col in get_df.columns] # you already have 'Description' column
get_df.head()
So, just paste it(by using concat or something).
df[get_df.columns] = get_df
df.head()

Python: Running function to append values to an empty list returns no values

This is probably a very basic question but I haven't been able to figure this out.
I'm currently using the following to append values to an empty list
shoes = {'groups':['running','walking']}
df_shoes_group_names = pd.DataFrame(shoes)
shoes_group_name=[]
for type in df_shoes_group_names['groups']:
shoes_group_name.append(type)
shoes_group_name
['running', 'walking']
I'm trying to accomplish the same using a for loop, however, when I execute the loop the list comes back as blank
shoes_group_name=[]
def list_builder(dataframe_name):
if 'shoes' in dataframe_name:
for type in df_shoes_group_names['groups']:
shoes_group_name.append(type)
list_builder(df_shoes_group_names)
shoes_group_name
[]
Reason for the function is that eventually I'll have multiple DF's with different product's so i'd like to just have if statements within the function to handle the creation of each list
so for example future examples could look like this:
df_shoes_group_names
df_boots_group_names
df_sandals_group_names
shoes_group_name=[]
boots_group_name=[]
sandals_group_name=[]
def list_builder(dataframe_name):
if 'shoes' in dataframe_name:
for type in df_shoes_group_names['groups']:
shoes_group_name.append(type)
elif 'boots' in dataframe_name:
for type in df_boots_group_names['groups']:
boots_group_name.append(type)
elif 'sandals' in dataframe_name:
for type in df_sandals_group_names['groups']:
sandals_group_name.append(type)
list_builder(df_shoes_group_names)
list_builder(df_boots_group_names)
list_builder(df_sandals_group_names)
Not sure if I'm approaching this the right way so any advice would be appreciated.
Best,

You should never call or search a variable name as if it were a string.
Instead, use a dictionary to store a variable number of variables.
Bad practice
# dataframes
df_shoes_group_names = pd.DataFrame(...)
df_boots_group_names = pd.DataFrame(...)
df_sandals_group_names = pd.DataFrame(...)
def foo(x):
if shoes in df_shoes_group_names: # <-- THIS WILL NOT WORK
# do something with x
Good practice
# dataframes
df_shoes_group_names = pd.DataFrame(...)
df_boots_group_names = pd.DataFrame(...)
df_sandals_group_names = pd.DataFrame(...)
dfs = {'shoes': df_shoes_group_names,
'boots': df_boots_group_names,
'sandals': df_sandals_group_names}
def foo(key):
if 'shoes' in key: # <-- THIS WILL WORK
# do something with dfs[key]

Parameterized for loop in Python

I need to write a parameterized for loop.
# This works but...
df["ID"]=np_get_defined(df["methodA"+"ID"], df["methodB"+"ID"],df["methodC"+"ID"])
# I need a for loop as follows
df["ID"]=np_get_defined(df[sm+"ID"] for sm in strmethods)
and I get the following error:
ValueError: Length of values does not match length of index
Remaining definitions:
import numpy as np
df is a Pandas.DataFrame
strmethods=['methodA','methodB','methodC']
def get_defined(*args):
strs = [str(arg) for arg in args if not pd.isnull(arg) and 'N/A' not in str(arg) and arg!='0']
return ''.join(strs) if strs else None
np_get_defined = np.vectorize(get_defined)

df["ID"]=np_get_defined(df[sm+"ID"] for sm in strmethods) means you're passing a generator as single argument to the called method.
If you want to expand the generated sequence to a list of arguments use the * operator:
df["ID"] = np_get_defined(*(df[sm + "ID"] for sm in strmethods))
# or:
df["ID"] = np_get_defined(*[df[sm + "ID"] for sm in strmethods])
The first uses a generator and unpacks its elements, the second uses a list comprehension instead, the result will be the same in either case.

I think the reason why it doesn't work is that your DataFrame consists of columns with different lengths.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to pass list to UserDefinedFunction (UDF) in pyspark - python

After a while, I figured out that all I need is to pass the list using the character '' before it. Example: df = df.withColumn("NEW_COLUMN", udf_cat(l)) That way, it will work.

Related

Python Functions: Pass parameter as string and reference?

Path inside the dictionary from the variable

repeat function to extract similar information

Python: Running function to append values to an empty list returns no values

Parameterized for loop in Python

Categories

Resources

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to pass list to UserDefinedFunction (UDF) in pyspark - python

After a while, I figured out that all I need is to pass the list using the character '*' before it. Example: df = df.withColumn("NEW_COLUMN", udf_cat(*l)) That way, it will work.

Related

Python Functions: Pass parameter as string and reference?

Path inside the dictionary from the variable

repeat function to extract similar information

Python: Running function to append values to an empty list returns no values

Parameterized for loop in Python

Categories

Resources

After a while, I figured out that all I need is to pass the list using the character '' before it. Example: df = df.withColumn("NEW_COLUMN", udf_cat(l)) That way, it will work.