PySpark - Calling a function within a UDF

PySpark - Calling a function within a UDF - python

I have created a UDF however I need to call a function within a UDF. It currently returns nulls. Could someone please explain why I am getting this error.
a= spark.createDataFrame([("A", 20), ("B", 30), ("D", 80)],["Letter", "distances"])
def get_number(num):
return range(num)
from pyspark.sql.functions import udf
def cate(label):
if label == 20:
counting_list = get_number(4)
return counting_list
else:
return [0]
udf_score=udf(cate, ArrayType(FloatType()))
a.withColumn("category_list", udf_score(a["distances"])).show(10)
out:
+------+---------+--------------------+
|Letter|distances| category_list|
+------+---------+--------------------+
| A| 20|[null, null, null...|
| B| 30| [null]|
| D| 80| [null]|
+------+---------+--------------------+

The datatype for your udf is not correct, since cate returns an array of integers not floats. Can you please change:
udf_score=udf(cate, ArrayType(FloatType()))
to:
udf_score=udf(cate, ArrayType(IntegerType()))
Hope this helps!
edit: assuming Python 2.x regarding range since as #Shane Halloran mentions in the comments, range behaves differently in Python 3.x

Related

WHEN function condition not getting honoured in pyspark [duplicate]

I have a udf function which takes the key and return the corresponding value from name_dict.
from pyspark.sql import *
from pyspark.sql.functions import udf, when, col
name_dict = {'James': "manager", 'Robert': 'director'}
func = udf(lambda name: name_dict[name])
The original dataframe: James and Robert are in the dict, but Michael is not.
data = [("James","M"),("Michael","M"),("Robert",None)]
test = spark.createDataFrame(data = data, schema = ['name', 'gender'])
test.show()
+-------+------+
| name|gender|
+-------+------+
| James| M|
|Michael| M|
| Robert| null|
+-------+------+
To prevent KeyError, I use the when condition to filter the rows before any operation, but it does not work.
test.withColumn('senior', when(col('name').isin(['James', 'Robert']), func(col('name'))).otherwise(col('gender'))).show()
PythonException: An exception was thrown from a UDF: 'KeyError:
'Michael'', from , line 8. Full traceback
below...
What is the cause of this and are there any feasible ways to solve this problem? Assume that not all the names are keys of the dictionary and for those that are not included, I would like to copy the value from another column, say gender here.

This actually the behavior of user-defined functions in Spark. You can read from the docs:
The user-defined functions do not support conditional expressions or
short circuiting in boolean expressions and it ends up with being
executed all internally. If the functions can fail on special rows,
the workaround is to incorporate the condition into the functions.
So in your case you need to rewrite your UDF as:
func = udf(lambda name: name_dict.get(name, "NA"))
Then calling it using:
test.withColumn('senior', func(col('name'))).show()
#+-------+------+--------+
#| name|gender| senior|
#+-------+------+--------+
#| James| M| manager|
#|Michael| M| NA|
#| Robert| null|director|
#+-------+------+--------+
However, in you case you can actually do this without having to use udf, by using a map column:
from itertools import chain
from pyspark.sql.functions import col, create_map, lit
map_col = create_map(*[lit(x) for x in chain(*name_dict.items())])
test.withColumn('senior', map_col[col('name')]).show()

When I need to use lambda (and when not) while creating a UDF Pyspark?

I do not completely understand when I need to use a lambda function in the definition of a UDF.
My prior understanding was that I needed lambda in order for the DataFrame to recognize that it has to iterate over each row but I have seen many applications of UDFs without a lambda expression.
For example:
I have a silly function that works well like this without using lambda:
#udf("string")
def unknown_city(s, city):
if s == 'KS' and 'MI':
return 'Unknown'
else:
return city
display(df2.
withColumn("new_city", unknown_city(col('geo.state'), col('geo.city')))
)
How can I make it work with lambda? Is it necessary?

Python lambda is just a way to write your functions. See the example code below and you will see they're pretty much the same, except that the lambda function is only for one-line code.
With lambda function
from pyspark.sql import functions as F
from pyspark.sql import types as T
df.withColumn('num+1', F.udf(lambda num: num + 1, T.IntegerType())('num')).show()
# +---+-----+
# |num|num+1|
# +---+-----+
# | 10| 11|
# | 20| 21|
# +---+-----+
With normal function
from pyspark.sql import functions as F
from pyspark.sql import types as T
def numplus2(num):
return num + 2
df.withColumn('num+2', F.udf(numplus2, T.IntegerType())('num')).show()
# +---+-----+
# |num|num+2|
# +---+-----+
# | 10| 12|
# | 20| 22|
# +---+-----+

Append a Numpy array into a Pyspark Dataframe

I need to append a NumPy array into a PySpark Dataframe.
The result needs to be like this, adding the var38mc variable:
+----+------+-------------+-------+
| ID|TARGET| var38|var38mc|
+----+------+-------------+-------+
| 1.0| 0.0| 117310.9790| True|
| 3.0| 0.0| 39205.17000| False|
| 4.0| 0.0| 117310.9790| True|
+----+------+-------------+-------+
Firstly, I calculated an array with the approximation of 117310.979016494 value.
array_var38mc = np.isclose(train3.select("var38").rdd.flatMap(lambda x: x).collect(), 117310.979016494)
The output is an object numpy.ndarray, like this [True, False, True]
Next, I'm trying to append a Numpy array, previously calculated with the data of this same PySpark.Dataframe.
train4 = train3.withColumn('var38mc',col(df_var38mc))
But I got this error:
AttributeError: 'DataFrame' object has no attribute '_get_object_id'
P.S.: I tried to convert the numpy array into a list and in another PySpark Dataframe without success.

Use UDF instead:
import pyspark.sql.functions as F
from pyspark.sql.types import BooleanType
import numpy as np
func = F.udf(lambda x: bool(np.isclose(x, 117310.979016494)), BooleanType())
train4 = train3.withColumn('var38mc', func('var38'))

How can I convert unicode to string of a dataframe column?

I have a spark dataframe which has a column 'X'.The column contains elements which are in the form:
u'[23,4,77,890,455,................]'
. How can I convert this unicode to list.That is my output should be
[23,4,77,890,455...................]
. I have apply it for each element in the 'X' column.
I have tried df.withColumn("X_new", ast.literal_eval(x)) and got the error
"Malformed String"
I also tried
df.withColumn("X_new", json.loads(x)) and got the error "Expected
String or Buffer"
and
df.withColumn("X_new", json.dumps(x)) which says JSON not
serialisable.
and also
df_2 = df.rdd.map(lambda x: x.encode('utf-8')) which says rdd has no
attribute encode.
I dont want to use collect and toPandas() because its memory consuming.(But if thats the only way please do tell).I am using Pyspark
Update: cph_sto gave the answer using UDF.Though it worked well,I find that it is Slow.Can Somebody suggest any other method?

import ast
from pyspark.sql.functions import udf
values = [(u'[23,4,77,890.455]',10),(u'[11,2,50,1.11]',20),(u'[10.05,1,22.04]',30)]
df = sqlContext.createDataFrame(values,['list','A'])
df.show()
+-----------------+---+
| list| A|
+-----------------+---+
|[23,4,77,890.455]| 10|
| [11,2,50,1.11]| 20|
| [10.05,1,22.04]| 30|
+-----------------+---+
# Creating a UDF to convert the string list to proper list
string_list_to_list = udf(lambda row: ast.literal_eval(row))
df = df.withColumn('list',string_list_to_list(col('list')))
df.show()
+--------------------+---+
| list| A|
+--------------------+---+
|[23, 4, 77, 890.455]| 10|
| [11, 2, 50, 1.11]| 20|
| [10.05, 1, 22.04]| 30|
+--------------------+---+
Extension of the Q, as asked by OP -
# Creating a UDF to find length of resulting list.
length_list = udf(lambda row: len(row))
df = df.withColumn('length_list',length_list(col('list')))
df.show()
+--------------------+---+-----------+
| list| A|length_list|
+--------------------+---+-----------+
|[23, 4, 77, 890.455]| 10| 4|
| [11, 2, 50, 1.11]| 20| 4|
| [10.05, 1, 22.04]| 30| 3|
+--------------------+---+-----------+

Since it's a string, you could remove the first and last characters:
From '[23,4,77,890,455]' to '23,4,77,890,455'
Then apply the split() function to generate an array, taking , as the delimiter.

Please use the below code to ignore unicode
df.rdd.map(lambda x: x.encode("ascii","ignore"))

pyspark passing column names from a list to dataframe function, how to interpolate? [duplicate]

This question already has answers here:
Spark SQL: apply aggregate functions to a list of columns
(4 answers)
Closed 4 years ago.
I have edited this question to provide an example -
I have a list of columns names :
colnames = ['col1','col2','col3']
I need to pass these to a Dataframe function one after another to return values for each. I would not use the groupBy function, so this is not a duplicate of the other question. I just need the max, min, sum of all values of each column in my Dataframe.
Code example -
from pyspark import SparkContext
from pyspark.sql import SQLContext
sc = SparkContext("local[2]", "Count App")
sqlContext = SQLContext(sc)
df = sqlContext.createDataFrame(
[(1, 100, 200), (100, 200, 100), (100, 200, 100), (-100, 50, 200)],
("col1", "col2", "col3"))
df.show()
+----+----+----+
|col1|col2|col3|
+----+----+----+
| 1| 100| 200|
| 100| 200| 100|
| 100| 200| 100|
|-100| 50| 200|
+----+----+----+
colnames = ['col1','col2','col3']
maxval = map(lambda x: df.agg(sparkMax(df[x]).alias('max_of_{}'.format(x))), colnames)
## This gives me a list of Dataframes, NOT a single Dataframe as required
for x in maxval:
print (x.show())
+-----------+
|max_of_col1|
+-----------+
| 100|
+-----------+
None
+-----------+
|max_of_col2|
+-----------+
| 200|
+-----------+
None
+-----------+
|max_of_col3|
+-----------+
| 200|
+-----------+
How do I get a single Dataframe back from my lambda function, instead of a List of Dataframes. Looking like this -
+----------------+
|Column_name| Max|
+-----------+----+
|max_of_col1| 100|
+-----------+----+
|max_of_col2| 200|
+-----------+----+
|max_of_col3| 300|
+-----------+----+
I'm guessing something like a flatMap?
Appreciated.

Map function in Python takes 2 args, first one being a function and second being a iterable.
newdf = map(lambda x: len(x), colnames)
This might be helpful - http://book.pythontips.com/en/latest/map_filter.html
df.x will not work. df is an object and you're accessing an attribute of the df object called x.
Have a look at the following example.
obj = type("MyObj", (object,), {'name':1})
a = obj()
print a.name
Above example will print the value of the attribute name as 1.
However if I try to do,
obj = type("MyObj", (object,), {'name':1})
a = obj()
var = 'name'
print a.var
this is going to give me a AttributeError as the object a does not have an attribute called var.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

PySpark - Calling a function within a UDF - python

Related

WHEN function condition not getting honoured in pyspark [duplicate]

When I need to use lambda (and when not) while creating a UDF Pyspark?

Append a Numpy array into a Pyspark Dataframe

How can I convert unicode to string of a dataframe column?

pyspark passing column names from a list to dataframe function, how to interpolate? [duplicate]

Categories

Resources