I have a udf function which takes the key and return the corresponding value from name_dict.
from pyspark.sql import *
from pyspark.sql.functions import udf, when, col
name_dict = {'James': "manager", 'Robert': 'director'}
func = udf(lambda name: name_dict[name])
The original dataframe: James and Robert are in the dict, but Michael is not.
data = [("James","M"),("Michael","M"),("Robert",None)]
test = spark.createDataFrame(data = data, schema = ['name', 'gender'])
test.show()
+-------+------+
| name|gender|
+-------+------+
| James| M|
|Michael| M|
| Robert| null|
+-------+------+
To prevent KeyError, I use the when condition to filter the rows before any operation, but it does not work.
test.withColumn('senior', when(col('name').isin(['James', 'Robert']), func(col('name'))).otherwise(col('gender'))).show()
PythonException: An exception was thrown from a UDF: 'KeyError:
'Michael'', from , line 8. Full traceback
below...
What is the cause of this and are there any feasible ways to solve this problem? Assume that not all the names are keys of the dictionary and for those that are not included, I would like to copy the value from another column, say gender here.
This actually the behavior of user-defined functions in Spark. You can read from the docs:
The user-defined functions do not support conditional expressions or
short circuiting in boolean expressions and it ends up with being
executed all internally. If the functions can fail on special rows,
the workaround is to incorporate the condition into the functions.
So in your case you need to rewrite your UDF as:
func = udf(lambda name: name_dict.get(name, "NA"))
Then calling it using:
test.withColumn('senior', func(col('name'))).show()
#+-------+------+--------+
#| name|gender| senior|
#+-------+------+--------+
#| James| M| manager|
#|Michael| M| NA|
#| Robert| null|director|
#+-------+------+--------+
However, in you case you can actually do this without having to use udf, by using a map column:
from itertools import chain
from pyspark.sql.functions import col, create_map, lit
map_col = create_map(*[lit(x) for x in chain(*name_dict.items())])
test.withColumn('senior', map_col[col('name')]).show()
Related
Spark dataframe which has column emailID : ram.shyam.78uy#testing.com. i would like to extract the string between "." and "#" i.e 78uy and store it in column.
tried
split_for_alias = split(rs_csv['emailID'],'[.]')
rs_csv_alias= rs_csv.withColumn('alias',split_for_alias.getItem(size(split_for_alias) -2))
Its adding 78uy#testing as alias. Another column can be added and chop off the extra values. But is it possible to do in single statement.
Extract the alphanumeric immediately to the left of special character . and immediately followed by special character #
DataFrame
data= [
(1,"am.shyam.78uy#testing.com"),
(2, "j.k.kilo#jom.com")
]
df=spark.createDataFrame(data, ("id",'emailID'))
df.show()
+---+--------------------+
| id| emailID|
+---+--------------------+
| 1|am.shyam.78uy#tes...|
| 2| j.k.kilo#jom.com|
+---+--------------------+
Code
df.withColumn('name', regexp_extract('emailID', '(?<=\.)(\w+)(?=\#)',1)).show()
outcome
+---+--------------------+----+
| id| emailID|name|
+---+--------------------+----+
| 1|am.shyam.78uy#tes...|78uy|
| 2| j.k.kilo#jom.com|kilo|
+---+--------------------+----+
We made the Fugue project to port native Python or Pandas code to Spark or Dask. This lets you can keep the logic very readable by expressing it in native Python. Fugue can then port it to Spark for you with one function call.
First we setup a Pandas DataFrame to test:
import pandas as pd
df = pd.DataFrame({"id":[1,2],"email": ["am.shyam.78uy#testing.com", "j.k.kilo#jom.com"]})
Next, we make a native Python function. The logic is clear this way.
from typing import List, Dict, Any
def extract(df:List[Dict[str,Any]]) -> List[Dict[str,Any]]:
for row in df:
email = row["email"].split("#")[0].split(".")[-1]
row["new_col"] = email
return df
Then we can test on the Pandas engine:
from fugue import transform
transform(df, extract, schema="*, new_col:str")
Because it works, we can bring it to Spark by supplying an engine:
import fugue_spark
transform(df, extract, schema="*, new_col:str", engine="spark").show()
+---+--------------------+-------+
| id| email|new_col|
+---+--------------------+-------+
| 1|am.shyam.78uy#tes...| 78uy|
| 2| j.k.kilo#jom.com| kilo|
+---+--------------------+-------+
Note .show() is needed because Spark evaluates lazily. This transform can take in both Pandas and Spark DataFrames and will output a Spark DataFrame if using the Spark engine.
I am trying to get the last string after '/'.
The column can look like this: "lala/mae.da/rg1/zzzzz" (not necessary only 3 /), and I'd like to return: zzzzz
In SQL and Python it's very easy, but I would like to know if there is a way to do it in PySpark.
Solving it in Python:
original_string = "lala/mae.da/rg1/zzzzz"
last_char_index = original_string.rfind("/")
new_string = original_string[last_char_index+1:]
or directly:
new_string = original_string.rsplit('/', 1)[1]
And in SQL:
RIGHT(MyColumn, CHARINDEX('/', REVERSE(MyColumn))-1)
For PySpark I was thinking something like this:
df = df.select(col("MyColumn").rsplit('/', 1)[1])
but I get the following error: TypeError: 'Column' object is not callable and I am not even sure Spark allows me to do rsplit at all.
Do you have any suggestion on how can I solve this?
Adding another solution even though #Pav3k's answer is great. element_at which gets an item at specific position out of a list:
from pyspark.sql import functions as F
df = df.withColumn('my_col_split', F.split(df['MyColumn'], '/'))\
.select('MyColumn',F.element_at(F.col('my_col_split'), -1).alias('rsplit')
)
>>> df.show(truncate=False)
+---------------------+------+
|MyColumn |rsplit|
+---------------------+------+
|lala/mae.da/rg1/zzzzz|zzzzz |
|fefe |fefe |
|fe/fe/frs/fs/fe32/4 |4 |
+---------------------+------+
Pav3k's DF used.
import pandas as pd
from pyspark.sql import functions as F
df = pd.DataFrame({"MyColumn": ["lala/mae.da/rg1/zzzzz", "fefe", "fe/fe/frs/fs/fe32/4"]})
df = spark.createDataFrame(df)
df.show(truncate=False)
# output
+---------------------+
|MyColumn |
+---------------------+
|lala/mae.da/rg1/zzzzz|
|fefe |
|fe/fe/frs/fs/fe32/4 |
+---------------------+
(
df
.withColumn("NewCol",
F.split("MyColumn", "/")
)
.withColumn("NewCol", F.col("Newcol")[F.size("NewCol") -1])
.show()
)
# output
+--------------------+------+
| MyColumn|NewCol|
+--------------------+------+
|lala/mae.da/rg1/z...| zzzzz|
| fefe| fefe|
| fe/fe/frs/fs/fe32/4| 4|
+--------------------+------+
Since Spark 2.4, you can use split built-in function to split your string then use element_at built-in function to get the last element of your obtained array, as follows:
from pyspark.sql import functions as F
df = df.select(F.element_at(F.split(F.col("MyColumn"), '/'), -1))
I do not completely understand when I need to use a lambda function in the definition of a UDF.
My prior understanding was that I needed lambda in order for the DataFrame to recognize that it has to iterate over each row but I have seen many applications of UDFs without a lambda expression.
For example:
I have a silly function that works well like this without using lambda:
#udf("string")
def unknown_city(s, city):
if s == 'KS' and 'MI':
return 'Unknown'
else:
return city
display(df2.
withColumn("new_city", unknown_city(col('geo.state'), col('geo.city')))
)
How can I make it work with lambda? Is it necessary?
Python lambda is just a way to write your functions. See the example code below and you will see they're pretty much the same, except that the lambda function is only for one-line code.
With lambda function
from pyspark.sql import functions as F
from pyspark.sql import types as T
df.withColumn('num+1', F.udf(lambda num: num + 1, T.IntegerType())('num')).show()
# +---+-----+
# |num|num+1|
# +---+-----+
# | 10| 11|
# | 20| 21|
# +---+-----+
With normal function
from pyspark.sql import functions as F
from pyspark.sql import types as T
def numplus2(num):
return num + 2
df.withColumn('num+2', F.udf(numplus2, T.IntegerType())('num')).show()
# +---+-----+
# |num|num+2|
# +---+-----+
# | 10| 12|
# | 20| 22|
# +---+-----+
I am importing data, which has a date column in yyyy.MM.dd format. Missing values have been marked as 0000.00.00. This 0000.00.00 is treated differently depending upon the function/module employed to bring the data in the dataframe.
.csv file looks like this -
2016.12.23,2016.12.23
0000.00.00,0000.00.00
Method 1: .csv()
schema = StructType([
StructField('date', StringType()),
StructField('date1', DateType()),
])
df = spark.read.schema(schema)\
.format('csv')\
.option('header','false')\
.option('sep',',')\
.option('dateFormat','yyyy.MM.dd')\
.load(path+'file.csv')
df.show()
+----------+----------+
| date| date1|
+----------+----------+
|2016.12.23|2016-12-23|
|0000.00.00|0002-11-30|
+----------+----------+
Method 2: .to_date()
from pyspark.sql.functions import to_date, col
df = sqlContext.createDataFrame([('2016.12.23','2016.12.23'),('0000.00.00','0000.00.00')],['date','date1'])
df = df.withColumn('date1',to_date(col('date1'),'yyyy.MM.dd'))
df.show()
+----------+----------+
| date| date1|
+----------+----------+
|2016.12.23|2016-12-23|
|0000.00.00| null|
+----------+----------+
Question: Why two methods give different results? I would have expected the get Null for both. In the first case instead I get 0002-11-30. Can anyone explain this anomaly?
I have created a UDF however I need to call a function within a UDF. It currently returns nulls. Could someone please explain why I am getting this error.
a= spark.createDataFrame([("A", 20), ("B", 30), ("D", 80)],["Letter", "distances"])
def get_number(num):
return range(num)
from pyspark.sql.functions import udf
def cate(label):
if label == 20:
counting_list = get_number(4)
return counting_list
else:
return [0]
udf_score=udf(cate, ArrayType(FloatType()))
a.withColumn("category_list", udf_score(a["distances"])).show(10)
out:
+------+---------+--------------------+
|Letter|distances| category_list|
+------+---------+--------------------+
| A| 20|[null, null, null...|
| B| 30| [null]|
| D| 80| [null]|
+------+---------+--------------------+
The datatype for your udf is not correct, since cate returns an array of integers not floats. Can you please change:
udf_score=udf(cate, ArrayType(FloatType()))
to:
udf_score=udf(cate, ArrayType(IntegerType()))
Hope this helps!
edit: assuming Python 2.x regarding range since as #Shane Halloran mentions in the comments, range behaves differently in Python 3.x