Curried UDF - Pyspark - python

I am trying to implement a UDF in spark; that can take both a literal and column as an argument. To achieve this, I believe I can use a curried UDF.
The function is used to match a string literal to each value in the column of a DataFrame. I have summarized the code below:-
def matching(match_string_1):
def matching_inner(match_string_2):
return difflib.SequenceMatcher(None, match_string_1, match_string_2).ratio()
return matching
hc.udf.register("matching", matching)
matching_udf = F.udf(matching, StringType())
df_matched = df.withColumn("matching_score", matching_udf(lit("match_string"))(df.column))
"match_string" is actually a value assigned to a list which I am iterating over.
Unfortunately this is not working as I had hoped; and I am receiving
"TypeError: 'Column' object is not callable".
I believe I am not calling this function correctly.

It should be something like this:
def matching(match_string_1):
def matching_inner(match_string_2):
return difflib.SequenceMatcher(
a=match_string_1, b=match_string_2).ratio()
# Here create udf.
return F.udf(matching_inner, StringType())
df.withColumn("matching_score", matching("match_string")(df.column))
If you want to support Column argument for match_string_1 you'll have to rewrite it like this:
def matching(match_string_1):
def matching_inner(match_string_2):
return F.udf(
lambda a, b: difflib.SequenceMatcher(a=a, b=b).ratio(),
StringType())(match_string_1, match_string_2)
return matching_inner
df.withColumn("matching_score", matching(F.lit("match_string"))(df.column)
Your current code doesn't work, matching_udf is and UDF and matching_udf(lit("match_string")) creates a Column expression instead of calling internal function.

Related

Vaex - NameError: Column or variable 'example_string' does not exist

I've recently started to use vaex for its great potentialities on large set of data.
I'm trying to apply the following function:
def get_columns(v: str, table_columns: List, pref: str = '', suff: str = '') -> List:
return [table_columns.index(i) for i in table_columns if (pref + v + suff) in i][0]
to a df as follows:
df["column_day"] = df.apply(get_columns, arguments=[df.part_day, table.columns.tolist(), "total_4wk_"])
but I get the error when I run df["column_day"]:
NameError: Column or variable 'total_4wk_' does not exist.
I do not understand what I am doing wrong, since other functions (with only one argument) I used with apply worked fine.
Thanks.
I believe vaex expects the arguments passed to apply to actually be expressions.
In your case table.columns.tolist() and "total_4wk_" are not expressions so it complains. So I would re-write your get_columns function such that it only takes in expressions as arguments, and I believe that will work.

Replace values in expression and perform operations on the replaced values in python format()

I am having an expression like 'a{year}-b{year-1}' and the value of 'year' can be any integer value and I want to replace the 'year' in the expression with whatever value of 'year' I have.
So when I tried to do
expression = 'a{year}-b{year-1}'
new_exp = expression.format(year=2001)
It's giving me KeyError: 'year-1'
Is it possible to get 'a2001-b2000'?
I write a fool class and function can do this:(I find it won't cover the original format function.)
class expression:
#staticmethod
def format(year):
return "a{}-b{}".format(year,year-1)
print(expression.format(year=2001))
Or you can use this,only pass one argument:
expression = lambda year:"a{}-b{}".format(year,year-1)
print(expression(year=2001))

AttributeError: 'Series' object has no attribute 'upper'

I'm new to python and pandas but I have a problem I cannot wrap my head around.
I'm trying to add a new column to my DataFrame. To achieve that I use the assign() function.
Most of the examples on the internet are painfully trivial and I cannot find a solution for my problem.
What works:
my_dataset.assign(new_col=lambda x: my_custom_long_function(x['long_column']))
def my_custom_long_function(input)
return input * 2
What doesn't work:
my_dataset.assign(new_col=lambda x: my_custom_string_function(x['string_column'])
def my_custom_string_function(input)
return input.upper()
What confuses me is that in the debug I can see that even for my_custom_long_function the parameter is a Series, not a long.
I just want to use the lambda function and pass a value of the column to do my already written complicated functions. How do I do this?
Edit: The example here is just for demonstrative purpose, the real code is basically an existing complex function that does not care about panda's types and needs a str as a parameter.
Because the column doesn't have a upper method, in order to use it, you need to do str.upper:
my_dataset.assign(new_col=lambda x: my_custom_string_function(x['string_column'])
def my_custom_string_function(input)
return input.str.upper()
That said, I would use:
my_dataset['new column'] = my_dataset['string_column'].str.upper()
For efficiency.
Edit:
my_dataset['new column'] = my_dataset['string_column'].apply(lambda x: my_custom_string_function(x))
def my_custom_string_function(input):
return input.upper()

Using length of string in each row of column of df as argument in a function

I am having some serious trouble with this! Suppose I have a pandas dataframe like the following:
Name LeftString RightString
nameA AATCGCTGCG TGCTGCTGCTT
nameB GTCGTGBAGB BTGHTAGCGTB
nameC ABCTHJKLAA BFTCHHFCTSH
....
I have a function that takes the following as arguments:
def localAlign(minAlignment, names, string1, string2):
# do something great
In my function, minAlignment is an integer, names, string1, and string2 are dataframe columns being used as list objects by the function.
I then call the function at a later point:
left1_2_compare = localAlign(12, df['Name'], df['LeftString'], df['RightString'])
My function runs with no issues, but the 12 is passed in as a hard coded value, or as a sys argument, but what I would rather it be is a variable that is 60% the length of the df['LeftString'].
So what I have tried in regards to this is to pass in a calculation that would return an int to the function argument:
left1_2_compare = localAlign((int(len(df['LeftString'])*0.6)),
df['Name'], df['LeftString'],
df['RightString'])
The interesting part about this is that the code doesn't fail or return errors, it just doesn't output anything for that value (the output file is blank for this part). The rest has data produced and good.
We see that the df has been defined before the function is called, is there a way to use the length of string in row1...rown as the input integer for the function without defining it inside of the function?
Need Series created by len, multiple by mul and cast to inttegers by astype:
left1_2_compare = localAlign((df['LeftString'].str.len().mul(.6)).astype(int),
df['Name'],
df['LeftString'],
df['RightString'])

Getting PicklingError: Can't pickle <type 'function'>: attribute lookup __builtin__.function failed in pyspark when calling UDF

I am using pyspark 2.0. i am getting pickling error for bellow code
from pyspark.sql.types import*
from datetime import datetime
from pyspark.sql.functions import udf
def getTime():
timevalue=datetime.now()
return timevalue
spark.udf.register('Getday', getTime,TimestampType())
def datetostring_conv(datevalue):
stringvalue=datevalue.strftime('%Y-%m-%d')
print stringvalue
intstring=stringvalue[0:4]+stringvalue[5:7]+stringvalue[8:10]
return intstring
spark.udf.register('IntString',lambda(x):datetostring_conv,StringType())
up to this when i am calling
spark.sql("select date_add(Getday(),-1)as stringtime").show()
i am getting the previous day value as date type but when i am trying to converting it into string avoiding '-'. which is IntString function job I am getting the pickling error
spark.sql("select IntString(date_add(GetDay(),1))as stringvalue").show()
how could i solve this error
thanks in advance
Either call the function:
spark.udf.register('IntString', lambda x: datetostring_conv(x), StringType())
or pass the function:
spark.udf.register('IntString', datetostring_conv, StringType())
When you use:
lambda x: datetostring_conv
you pass an unary function which returns a function:
type((lambda x: datetostring_conv)(datetime.now()))
function
hence the exception.
Of course there is no need for an UDF:
spark.sql("SELECT date_format(date_add(current_date(), -1), 'YYYYMMdd')")
Notes:
You shouldn't use parentheses with argument list of lambda expressions. This:
Has no effect with a single argument.
With more than one argument:
Has a special meaning in Python 2 (tuple argument unpacking).
Is not supported in Python 3.
I got this error because I didn't include the argument when registering the function:
def find_thresh(wav):
[some code returning int]
convertUDF = udf(lambda z: find_thresh,IntegerType())
when it should have been:
convertUDF = udf(lambda z: find_thresh(z),IntegerType())
Just in case it helps.

Categories