Pyspark Dataframe Join using UDF - python

I'm trying to create a custom join for two dataframes (df1 and df2) in PySpark (similar to this), with code that looks like this:
my_join_udf = udf(lambda x, y: isJoin(x, y), BooleanType())
my_join_df = df1.join(df2, my_join_udf(df1.col_a, df2.col_b))
The error message I'm getting is:
java.lang.RuntimeException: Invalid PythonUDF PythonUDF#<lambda>(col_a#17,col_b#0), requires attributes from more than one child
Is there a way to write a PySpark UDF that can process columns from two separate dataframes?

Spark 2.2+
You have to use crossJoin or enable cross joins in the configuration:
df1.crossJoin(df2).where(my_join_udf(df1.col_a, df2.col_b))
Spark 2.0, 2.1
Method shown below doesn't work anymore in Spark 2.x. See SPARK-19728.
Spark 1.x
Theoretically you can join and filter:
df1.join(df2).where(my_join_udf(df1.col_a, df2.col_b))
but in general you shouldn't to it all. Any type of join which is not based on equality requires a full Cartesian product (same as the answer) which is rarely acceptable (see also Why using a UDF in a SQL query leads to cartesian product?).

Related

Using pandas functions with Pyspark

I'm trying to rewrite my Python script (Pandas) with Pyspark, but I can't find a way to apply my Pandas functions in a way to be more efficient with Pyspark functions:
my functions are the following:
def decompose_id(id_flight):
my_id=id_flight.split("_")
Esn=my_id[0]
Year=my_id[3][0:4]
Month=my_id[3][4:6]
return Esn, Year, Month
def reverse_string(string):
stringlength=len(string) # calculate length of the list
slicedString=string[stringlength::-1] # slicing
return slicedString
I would like to apply the first function to a column of a dataframe (in Pandas I get a row of three elements)
The second functions is used when a condition of a column of a DataFrame is verified
is there a method to apply them using Pyspark dataframes?
You could apply these functions as UDF to a Spark column, but it is not very efficient.
Here are the functions you need to perform your task :
reverse : use it to replace your function reverse_string
split : Use is to replace my_id=id_flight.split("_")
getItem : use it to get the item in splitted list my_id[3]
substr : to replace the slicing in python [0:4]
Just combine these spark functions to recreate the same behavior.
IF you want to leverage pandas functionality one way would be to use - Pandas API along with groupBy
It provides you a way to treat each of your groupBy sets as pandas dataframe on which you can implement your functions.
However since its Spark , schema enforcement is pretty man necessary as you will go through the examples provided in the link as well
An implementation example can be found here
For Trivial Tasks , like reversing the string opt for inbuilt Spark Functions , else UDF's

How to JOIN hundreds of Spark dataframes efficiently?

TL;DR;
“What is the most optimal way of joining thousands of Spark
dataframes? Can we parallelize this join? Since both do not work for
me.”
I am trying to join thousands of single column dataframes (with a PK col for join) and then persisting the result DF to Snowflake.
The join operation by looping around 400 (5m x 2) such dataframes is taking over 3 hours to complete on a standalone 32core/350g Spark cluster, which I think shouldn't matter because of the pushdown. After all, shouldn't all Spark be doing is build the DAG for lazy evaluation?
Here is my Spark config:
spark = SparkSession \
.builder \
.appName("JoinTest")\
.config("spark.master","spark://localhost:7077")\
.config("spark.ui.port", 8050)\
.config("spark.jars", "../drivers/spark-snowflake_2.11-2.5.2-spark_2.4.jar,../drivers/snowflake-jdbc-3.9.1.jar")\
.config("spark.driver.memory", "100g")\
.config("spark.driver.maxResultSize", 0)\
.config("spark.executor.memory", "64g")\
.config("spark.executor.instances", "6")\
.config("spark.executor.cores","4") \
.config("spark.cores.max", "32")\
.getOrCreate()
And the JOIN loop:
def combine_spark_results(results, joinKey):
# Extract first to get going
# TODO: validations
resultsDF = results[0]
i = len(results)
print("Joining Spark DFs..")
for result in results[1:]:
print(i, end=" ", flush=True)
i -= 1
resultsDF = resultsDF.join(result, joinKey, 'outer')
return resultsDF
I considered parallelizing the joins in a merge-sort fashion using starmapasync() however, the issue is, a Spark DF cannot be returned from another thread. I also considered broadcasting the main dataframe from which all joinable single-col dataframes were created,
spark.sparkContext.broadcast(data)
but that throws the same error as attempting to return a joined DF from another thread, viz.
PicklingError: Could not serialize broadcast: Py4JError: An error
occurred while calling o3897.getstate. Trace: py4j.Py4JException:
Method getstate([]) does not exist
How can I solve this problem?
Please feel free to ask if you need any more info. Thanks in advance.

Pandas parallel apply with koalas (pyspark)

I'm new to Koalas (pyspark), and I was trying to utilize Koalas for parallel apply, but it seemed like it was using a single core for the whole operation (correct me if I'm wrong) and ended up using dask for parallel apply (using map_partition) which worked pretty well.
However, I would like to know if there's a way to utilize Koalas for parallel apply.
I used basic codes for operation like below.
import pandas as pd
import databricks.koalas as ks
my_big_data = ks.read_parquet('my_big_file') # file is single partitioned parquet file
my_big_data['new_column'] = my_big_data['string_column'].apply(my_prep) # my_prep does stirng operations
my_big_data.to_parquet('my_big_file_modified') # for Koalas does lazy evaluation
I found a link that discuss this problem. https://github.com/databricks/koalas/issues/1280
If the number of rows that are being applied by function is less than 1,000 (default value), then pandas dataframe will be called to do the operation.
The user defined function above my_prep is applied to each row, so single core pandas was being used.
In order to force it to work in pyspark (parallel) manner, user should modify the configuration as below.
import databricks.koalas as ks
ks.set_option('compute.default_index_type','distributed') # when .head() call is too slow
ks.set_option('compute.shortcut_limit',1) # Koalas will apply pyspark
Also, explicitly specifying type (type hint) in the user defined function will make Koalas not to go shortcut path and will make parallel.
def my_prep(row) -> string:
return row
kdf['my_column'].apply(my_prep)

Transforming Python Lambda function without return value to Pyspark

I have a working lambda function in Python that computes the highest similarity between each string in dataset1 and the strings in dataset2. During an iteration, it writes the string, the best match and the similarity together with some other information to bigquery. There is no return value, as the purpose of the function is to insert a row into a bigquery dataset. This process takes rather long which is why I wanted to use Pyspark and Dataproc to speed up the process.
Converting the pandas dataframes to spark was easy. I am having trouble to register my udf, because it has no return value and pyspark requires one. In addition I don't understand how to map the 'apply' function in python to the pyspark variant. So basically my question is how to transform the python code below to work on a spark dataframe.
The following code works in a regular Python environment:
def embargomatch(name, code, embargo_names):
find best match
insert best match and additional information to bigquery
customer_names.apply(lambda x: embargoMatch(x['name'], x['customer_code'],embargo_names),axis=1)
Because pyspark requires a return type, I added 'return 1' to the udf and tried the following:
customer_names = spark.createDataFrame(customer_names)
from pyspark.sql.types import IntegerType
embargo_match_udf = udf(lambda x: embargoMatch(x['name'], x['customer_code'],embargo_names), IntegerType())
Now i'm stuck trying to apply the select function, as I don't know what parameters to give.
I suspect you're stuck on how to pass multiple columns to the udf -- here's a good answer to that question: Pyspark: Pass multiple columns in UDF.
Rather than creating a udf based on a lambda that wraps your function, consider simplifying by creating a udf based on embargomatch directly.
embargo_names = ...
# The parameters here are the columns passed into the udf
def embargomatch(name, customer_code):
pass
embargo_match_udf = udf(embargomatch, IntegerType())
customer_names.select(embargo_match_udf(array('name', 'customer_code')).alias('column_name'))
That being said, it's suspect that your udf doesn't return anything -- I generally see udfs as a way to add columns to the dataframe, but not to have side effects. If you want to insert records into bigquery, consider doing something like this:
customer_names.select('column_name').write.parquet('gs://some/path')
os.system("bq load --source_format=PARQUET [DATASET].[TABLE] gs://some/path")

Filter df when values matches part of a string in pyspark

I have a large pyspark.sql.dataframe.DataFrame and I want to keep (so filter) all rows where the URL saved in the location column contains a pre-determined string, e.g. 'google.com'.
I have tried:
import pyspark.sql.functions as sf
df.filter(sf.col('location').contains('google.com')).show(5)
But this throws:
TypeError: _TypeError: 'Column' object is not callable'
How do I go around and filter my df properly?
Spark 2.2 onwards
df.filter(df.location.contains('google.com'))
Spark 2.2 documentation link
Spark 2.1 and before
You can use plain SQL in filter
df.filter("location like '%google.com%'")
or with DataFrame column methods
df.filter(df.location.like('%google.com%'))
Spark 2.1 documentation link
pyspark.sql.Column.contains() is only available in pyspark version 2.2 and above.
df.where(df.location.contains('google.com'))
When filtering a DataFrame with string values, I find that the pyspark.sql.functions lower and upper come in handy, if your data could have column entries like "foo" and "Foo":
import pyspark.sql.functions as sql_fun
result = source_df.filter(sql_fun.lower(source_df.col_name).contains("foo"))
You can try the following expression, which helps you search for multiple strings at the same time:
df.filter(""" location rlike 'google.com|amazon.com|github.com' """)

Categories