TL;DR;
“What is the most optimal way of joining thousands of Spark
dataframes? Can we parallelize this join? Since both do not work for
me.”
I am trying to join thousands of single column dataframes (with a PK col for join) and then persisting the result DF to Snowflake.
The join operation by looping around 400 (5m x 2) such dataframes is taking over 3 hours to complete on a standalone 32core/350g Spark cluster, which I think shouldn't matter because of the pushdown. After all, shouldn't all Spark be doing is build the DAG for lazy evaluation?
Here is my Spark config:
spark = SparkSession \
.builder \
.appName("JoinTest")\
.config("spark.master","spark://localhost:7077")\
.config("spark.ui.port", 8050)\
.config("spark.jars", "../drivers/spark-snowflake_2.11-2.5.2-spark_2.4.jar,../drivers/snowflake-jdbc-3.9.1.jar")\
.config("spark.driver.memory", "100g")\
.config("spark.driver.maxResultSize", 0)\
.config("spark.executor.memory", "64g")\
.config("spark.executor.instances", "6")\
.config("spark.executor.cores","4") \
.config("spark.cores.max", "32")\
.getOrCreate()
And the JOIN loop:
def combine_spark_results(results, joinKey):
# Extract first to get going
# TODO: validations
resultsDF = results[0]
i = len(results)
print("Joining Spark DFs..")
for result in results[1:]:
print(i, end=" ", flush=True)
i -= 1
resultsDF = resultsDF.join(result, joinKey, 'outer')
return resultsDF
I considered parallelizing the joins in a merge-sort fashion using starmapasync() however, the issue is, a Spark DF cannot be returned from another thread. I also considered broadcasting the main dataframe from which all joinable single-col dataframes were created,
spark.sparkContext.broadcast(data)
but that throws the same error as attempting to return a joined DF from another thread, viz.
PicklingError: Could not serialize broadcast: Py4JError: An error
occurred while calling o3897.getstate. Trace: py4j.Py4JException:
Method getstate([]) does not exist
How can I solve this problem?
Please feel free to ask if you need any more info. Thanks in advance.
Related
I'm trying to understand how to leverage cache() to improve my performance. Since cache retains a DataFrame in memory "for reuse", it seems like i need to understand the conditions that eject the DataFrame from memory to better understand how to leverage it.
After defining transformations, I call an action, is the dataframe, after the action completes, gone from memory? This would imply that if I do execute an action on a dataframe, but I continue to do other stuff with the data, all the previous parts of the DAG, from the read to the action, will need to be re done.
Is this accurate?
The fact is, that after an action is executed another dag is created. You can check this via SparkUI
In code you can try to identify where your dag is done and new started by looking for actions
When looking at code you can use this simple rule:
When function is transforming one df into another df - its not action but only lazy evaluated transformation. Even if this is join or something else which requires shuffling
When fuction is returning value other dan df, then you are using and action (for example count which is returning Long)
Lets take a look at this code (Its Scala but api is similar). I created this example just to show you the mechanism, this could be done better of course :)
import org.apache.spark.sql.functions.{col, lit, format_number}
import org.apache.spark.sql.DataFrame
val input = spark.read.format("csv")
.option("header", "true")
.load("dbfs:/FileStore/shared_uploads/***#gmail.com/city_temperature.csv")
val dataWithTempInCelcius: DataFrame = input
.withColumn("avg_temp_celcius",format_number((col("AvgTemperature").cast("integer") - lit(32)) / lit(1.8), 2).cast("Double"))
val dfAfterGrouping: DataFrame = dataWithTempInCelcius
.filter(col("City") === "Warsaw")
.groupBy(col("Year"), col("Month"))
.max("avg_temp_celcius")//Not an action, we are still doing transofrmation from df to another df
val maxTemp: Row = dfAfterGrouping
.orderBy(col("Year"))
.first() // <--- first is our first action, output from first is a Row and not df
dataWithTempInCelcius
.filter(col("City") === "Warsaw")
.count() // <--- count is our second action
Here you may see what is the problem. It looks like between first and second action i am doing transformation which was already done in first dag. This intermediate results of calculation was not cached, so in second dag Spark is unable to get the dataframe after filter from the memory which leads us to recomputation. Spark is going to fetch the data again, apply our filter and then calculate the count.
In SparkUI u will find two separate dags and both of them are going to read the source csv
If you cache intermediate results after first .filter(col("City") === "Warsaw") and then use this cached DF to do grouping and count you will still find two separate dags (number of action has not changed) but this time in the plan for second dag you will find "In memory table scan" instead of read of a csv file - that means that Spark is reading data from cache
Now you can see in memory relation in plan. There is still read csv node in the dag but as you can see, for second action its skipped (0 bytes read)
** I am using Databrics cluster with Spark 3.2, SparkUI may look different on your env
Quote...
Using cache() and persist() methods, Spark provides an optimization mechanism to store the intermediate computation of a Spark DataFrame so they can be reused in subsequent actions.
See https://sparkbyexamples.com/spark/spark-dataframe-cache-and-persist-explained/
I've been trying to adapt my code to utilize Dask to utilize multiple machines for processing. While the initial data load is not time-consuming, the subsequent processing takes roughly 12 hours on an 8-core i5. This isn't ideal and figured that using Dask to help spread the processing across machines would be beneficial. The following code works fine with the standard Pandas approach:
import pandas as pd
artists = pd.read_csv("artists.csv")
print(f"... shape before cleaning {artists.shape}")
artists["name"] = artists["name"].astype("str")
artists["name"] = (
artists["name"]
.str.encode("ascii", "ignore")
.str.decode("ascii")
.str.lower()
.str.replace("&", " and ", regex=False)
.str.strip()
)
Converting to Dask seemed straightforward, but I'm hitting hiccups along the way. The following Dask-adapted code throws a ValueError: cannot reindex from a duplicate axis error:
import dask.dataframe as dd
from dask.distributed import Client
artists = dd.read_csv("artists.csv")
print(f"... shape before cleaning {artists.shape}")
artists["name"] = artists["name"].astype(str).compute()
artists["name"] = (
artists["name"]
.str.encode("ascii", "ignore")
.str.decode("ascii")
.str.lower()
.str.replace("&", " and ", regex=False)
.str.strip().compute()
)
if __name__ == '__main__':
client = Client()
The best I can discern is that Dask won't allow reassignment to an existing Dask DataFrame. So this works:
...
artists_new = artists["name"].astype("str").compute()
...
However, I really don't want to create a new DataFrame each time. I'd rather replace the existing DataFrame with a new one, mainly because I have multiple data cleaning steps before processing.
While the tutorial and guides are useful, they are pretty basic and don't cover such use cases.
What are the preferred approaches here with Dask DataFrames?
Every time you call .compute() on Dask dataframe/series, it converts it into pandas. So what is happening in this line
artists["name"] = artists["name"].astype(str).compute()
is that you are computing the string column and then assigning pandas series to a dask series (without ensuring alignment of partitions). The solution is to call .compute() only on the final result, while intermediate steps can use regular pandas syntax:
# modified example (.compute is removed)
artists["name"] = artists["name"].astype(str).str.lower()
My problem is similar to one posted here (Spark join exponentially slow), but didn't help solve mine.
In summary, I am performing a simple join on seven 10x2 toy Spark Dataframes. Data is fetched from Snowflake using Spark.
Basically, they are all single column dataframes slapped with monotonically_increasing_id's to help with join. And it takes forever to return when I evaluate (count()) the result.
Here is the JOIN routine:
def combine_spark_results(results):
# Extract first to get going
# TODO: validations
resultsDF = results[0]
i = len(results)
# Start the joining dance for rest
for result in results[1:]:
print(i, end=" ", flush=True)
i -= 1
resultsDF = resultsDF.join(result, 'id', 'outer')
resultsDF = resultsDF.drop('id')
return resultsDF
resultsDF = combine_spark_results(aggr_out)
And then the picayune mammoth:
resultsDF.count()
For some reason, running count on the perceivably simple result drives Spark berserk with 1000+ tasks across 14 stages.
As mentioned before, I am working with Snowflake on Spark that enables Pushdown Optimizations by default. Since Snowflake inserts its plan into Catalyst, it's only relevant to mention this.
Here are the Spark-SF driver details:
Spark vr: 2.4.4
Python 3
spark-snowflake_2.11-2.5.2-spark_2.4.jar,
snowflake-jdbc-3.9.1.jar
I also attempted disabling pushdown but suspect it doesn't really apply (take effect).
Is there a better way of doing this? Am I missing out on something obvious?
Thanks in advance.
I'm trying to create a custom join for two dataframes (df1 and df2) in PySpark (similar to this), with code that looks like this:
my_join_udf = udf(lambda x, y: isJoin(x, y), BooleanType())
my_join_df = df1.join(df2, my_join_udf(df1.col_a, df2.col_b))
The error message I'm getting is:
java.lang.RuntimeException: Invalid PythonUDF PythonUDF#<lambda>(col_a#17,col_b#0), requires attributes from more than one child
Is there a way to write a PySpark UDF that can process columns from two separate dataframes?
Spark 2.2+
You have to use crossJoin or enable cross joins in the configuration:
df1.crossJoin(df2).where(my_join_udf(df1.col_a, df2.col_b))
Spark 2.0, 2.1
Method shown below doesn't work anymore in Spark 2.x. See SPARK-19728.
Spark 1.x
Theoretically you can join and filter:
df1.join(df2).where(my_join_udf(df1.col_a, df2.col_b))
but in general you shouldn't to it all. Any type of join which is not based on equality requires a full Cartesian product (same as the answer) which is rarely acceptable (see also Why using a UDF in a SQL query leads to cartesian product?).
I am a spark application with several points where I would like to persist the current state. This is usually after a large step, or caching a state that I would like to use multiple times. It appears that when I call cache on my dataframe a second time, a new copy is cached to memory. In my application, this leads to memory issues when scaling up. Even though, a given dataframe is a maximum of about 100 MB in my current tests, the cumulative size of the intermediate results grows beyond the alloted memory on the executor. See below for a small example that shows this behavior.
cache_test.py:
from pyspark import SparkContext, HiveContext
spark_context = SparkContext(appName='cache_test')
hive_context = HiveContext(spark_context)
df = (hive_context.read
.format('com.databricks.spark.csv')
.load('simple_data.csv')
)
df.cache()
df.show()
df = df.withColumn('C1+C2', df['C1'] + df['C2'])
df.cache()
df.show()
spark_context.stop()
simple_data.csv:
1,2,3
4,5,6
7,8,9
Looking at the application UI, there is a copy of the original dataframe, in adition to the one with the new column. I can remove the original copy by calling df.unpersist() before the withColumn line. Is this the recommended way to remove cached intermediate result (i.e. call unpersist before every cache()).
Also, is it possible to purge all cached objects. In my application, there are natural breakpoints where I can simply purge all memory, and move on to the next file. I would like to do this without creating a new spark application for each input file.
Thank you in advance!
Spark 2.x
You can use Catalog.clearCache:
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate
...
spark.catalog.clearCache()
Spark 1.x
You can use SQLContext.clearCache method which
Removes all cached tables from the in-memory cache.
from pyspark.sql import SQLContext
from pyspark import SparkContext
sqlContext = SQLContext.getOrCreate(SparkContext.getOrCreate())
...
sqlContext.clearCache()
We use this quite often
for (id, rdd) in sc._jsc.getPersistentRDDs().items():
rdd.unpersist()
print("Unpersisted {} rdd".format(id))
where sc is a sparkContext variable.
When you use cache on dataframe it is one of the transformation and gets evaluated lazily when you perform any action on it like count(),show() etc.
In your case after doing first cache you are calling show() that is the reason the dataframe is cached in memory. Now then you are again performing transformation on the dataframe to add additional column and again caching the new dataframe and then calling the action command show again and this would cache the second dataframe in memory. In case if size of your dataframe is big enough to just hold one dataframe then when you cache the second dataframe it would remove the first dataframe from the memory as it does not have enough space to hold the second dataframe.
Thing to keep in mind: You should not cache a dataframe unless you are using it in multiple actions otherwise it would be an overload in terms of performance as caching itself is costlier operation.