Joining multiple dataframes extremely slow on Spark - python

My problem is similar to one posted here (Spark join exponentially slow), but didn't help solve mine.
In summary, I am performing a simple join on seven 10x2 toy Spark Dataframes. Data is fetched from Snowflake using Spark.
Basically, they are all single column dataframes slapped with monotonically_increasing_id's to help with join. And it takes forever to return when I evaluate (count()) the result.
Here is the JOIN routine:
def combine_spark_results(results):
# Extract first to get going
# TODO: validations
resultsDF = results[0]
i = len(results)
# Start the joining dance for rest
for result in results[1:]:
print(i, end=" ", flush=True)
i -= 1
resultsDF = resultsDF.join(result, 'id', 'outer')
resultsDF = resultsDF.drop('id')
return resultsDF
resultsDF = combine_spark_results(aggr_out)
And then the picayune mammoth:
resultsDF.count()
For some reason, running count on the perceivably simple result drives Spark berserk with 1000+ tasks across 14 stages.
As mentioned before, I am working with Snowflake on Spark that enables Pushdown Optimizations by default. Since Snowflake inserts its plan into Catalyst, it's only relevant to mention this.
Here are the Spark-SF driver details:
Spark vr: 2.4.4
Python 3
spark-snowflake_2.11-2.5.2-spark_2.4.jar,
snowflake-jdbc-3.9.1.jar
I also attempted disabling pushdown but suspect it doesn't really apply (take effect).
Is there a better way of doing this? Am I missing out on something obvious?
Thanks in advance.

Related

Does Spark eject DataFrame data from memory every time an action is executed?

I'm trying to understand how to leverage cache() to improve my performance. Since cache retains a DataFrame in memory "for reuse", it seems like i need to understand the conditions that eject the DataFrame from memory to better understand how to leverage it.
After defining transformations, I call an action, is the dataframe, after the action completes, gone from memory? This would imply that if I do execute an action on a dataframe, but I continue to do other stuff with the data, all the previous parts of the DAG, from the read to the action, will need to be re done.
Is this accurate?
The fact is, that after an action is executed another dag is created. You can check this via SparkUI
In code you can try to identify where your dag is done and new started by looking for actions
When looking at code you can use this simple rule:
When function is transforming one df into another df - its not action but only lazy evaluated transformation. Even if this is join or something else which requires shuffling
When fuction is returning value other dan df, then you are using and action (for example count which is returning Long)
Lets take a look at this code (Its Scala but api is similar). I created this example just to show you the mechanism, this could be done better of course :)
import org.apache.spark.sql.functions.{col, lit, format_number}
import org.apache.spark.sql.DataFrame
val input = spark.read.format("csv")
.option("header", "true")
.load("dbfs:/FileStore/shared_uploads/***#gmail.com/city_temperature.csv")
val dataWithTempInCelcius: DataFrame = input
.withColumn("avg_temp_celcius",format_number((col("AvgTemperature").cast("integer") - lit(32)) / lit(1.8), 2).cast("Double"))
val dfAfterGrouping: DataFrame = dataWithTempInCelcius
.filter(col("City") === "Warsaw")
.groupBy(col("Year"), col("Month"))
.max("avg_temp_celcius")//Not an action, we are still doing transofrmation from df to another df
val maxTemp: Row = dfAfterGrouping
.orderBy(col("Year"))
.first() // <--- first is our first action, output from first is a Row and not df
dataWithTempInCelcius
.filter(col("City") === "Warsaw")
.count() // <--- count is our second action
Here you may see what is the problem. It looks like between first and second action i am doing transformation which was already done in first dag. This intermediate results of calculation was not cached, so in second dag Spark is unable to get the dataframe after filter from the memory which leads us to recomputation. Spark is going to fetch the data again, apply our filter and then calculate the count.
In SparkUI u will find two separate dags and both of them are going to read the source csv
If you cache intermediate results after first .filter(col("City") === "Warsaw") and then use this cached DF to do grouping and count you will still find two separate dags (number of action has not changed) but this time in the plan for second dag you will find "In memory table scan" instead of read of a csv file - that means that Spark is reading data from cache
Now you can see in memory relation in plan. There is still read csv node in the dag but as you can see, for second action its skipped (0 bytes read)
** I am using Databrics cluster with Spark 3.2, SparkUI may look different on your env
Quote...
Using cache() and persist() methods, Spark provides an optimization mechanism to store the intermediate computation of a Spark DataFrame so they can be reused in subsequent actions.
See https://sparkbyexamples.com/spark/spark-dataframe-cache-and-persist-explained/

When does Spark dataframe pull the data from source

I am new to spark dataframes and a little bit confused on its working. I had 2 similar piece of codes which is taking different amount of time to complete. Any explanation for these would be really helpful:
selectQuery = "SELECT * FROM items limit 1"
items_df = spark_session.sql(selectQuery) # completes in ~4 seconds
data_collect = items_df.collect() # completes in ~50 seconds
selectQuery = "SELECT * FROM items ORDER BY ingest_date limit 1"
items_df = spark_session.sql(selectQuery) # completes in ~4 seconds
data_collect = items_df.collect() # completes in ~20 minutes
My thought process over here was that, spark_session.sql(selectQuery) is the actual code that pulls data from Source and puts it in the application memory in the form of dataframe. Then collect() simply converts that dataframe into a python list.
But clearly, I see that collect() depends on the query as well.
PS: I saw an answer in this thread where a person mentions that collect() activates the actual query execution. I do not quite understand this.
Pyspark performance: dataframe.collect() is very slow
When you create a dataframe, you can apply two kind of methods : Transformations and Actions.
Transformations are lazy, it means they are not executed right away. Transformations are for example : withColumn, where, select ... A transformation always return a dataframe. When executed, Spark just checks if they are valid: for exemple, withColumn is a transformation, and when you apply it to your dataframe, the validity of your request is checked directly but not executed :
from pyspark.sql import functions as F
df = spark.range(10)
df.withColumn("id_2", F.col("id") * 2) # This one is OK
df.withColumn("id_2", F.col("foo") * 2) # This one will fail because foo does not exist
Actions are the method that will execute all the transformations that have been stacked. Actions are, for example, count, collect, show.
At the moment you apply an action, Spark will retrieve the data where it is stored, apply all the transformation you asked previously, and return a result depending on the action you asked (a list if it is a collect, a number if you made a count)
In you case, the creation of the dataframe is a transformation. It is lazy, just the validity will be checked. That is why both queries takes approx the same amount of time.
But when collecting, that's the moment when spark will retrieve all the data. So depending on the amount of data in the tables (and the number of transformations you asked) it will take more or less time to complete.

Combining two dataframes with pandas/numpy

I'm trying to write script that gets the data from two different databases and gives the result which is a CSV file with combined data.
I managed to get data using psycopg2 and pandas read_sql_query, I turned results into two different dataframes and all of that works great. I wrote all of that with only a little information about those databases so I used databases I had and some simple queries.
All of that is on my github:
https://github.com/tomasz-urban/sql-db-get
With more detailed info about what needs to be done I'm stuck...
In the first database there are users limitations: lim_val_1 and lim_val_2 together with user_id (couple thousand rows). Second one holds usage with val_1 and val_2 gathered every some period of time (couple hundred thousand rows).
I need to get those rows where users reach their limits (doesn't matter if it is lim_val_1 or lim_val_2 or both, I need all of that).
To visualize it better there are some simple tables in the link:
Databases info with output
My last approach:
result_query = df2.loc[(df2['val_1'] == df1['lim_val_1']) & (df2['val_2'] == df1['lim_val_2'])]
output_data = pd.DataFrame(result_query)
and I'm getting an error:
"ValueError: Can only compare identically-labeled Series objects"
I cannot label those columns the same so I think this solution will not work for me. I also tried merging with no result.
Anyone could help me with this ?
Since df1 and df2 have different numbers of rows, you cannot compare the values directly, try joining the data frames and then select where the conditions are met.
If user_id is the index of both df:
df_3 = df1.join(df2)
df_3[
(df_3['val_1'] == df_3['lim_val_1']) |
(df_3['val_2'] == df_3['lim_val_2']
]
You might want to replace == in for >= in case where limits can be surpassed.

How to JOIN hundreds of Spark dataframes efficiently?

TL;DR;
“What is the most optimal way of joining thousands of Spark
dataframes? Can we parallelize this join? Since both do not work for
me.”
I am trying to join thousands of single column dataframes (with a PK col for join) and then persisting the result DF to Snowflake.
The join operation by looping around 400 (5m x 2) such dataframes is taking over 3 hours to complete on a standalone 32core/350g Spark cluster, which I think shouldn't matter because of the pushdown. After all, shouldn't all Spark be doing is build the DAG for lazy evaluation?
Here is my Spark config:
spark = SparkSession \
.builder \
.appName("JoinTest")\
.config("spark.master","spark://localhost:7077")\
.config("spark.ui.port", 8050)\
.config("spark.jars", "../drivers/spark-snowflake_2.11-2.5.2-spark_2.4.jar,../drivers/snowflake-jdbc-3.9.1.jar")\
.config("spark.driver.memory", "100g")\
.config("spark.driver.maxResultSize", 0)\
.config("spark.executor.memory", "64g")\
.config("spark.executor.instances", "6")\
.config("spark.executor.cores","4") \
.config("spark.cores.max", "32")\
.getOrCreate()
And the JOIN loop:
def combine_spark_results(results, joinKey):
# Extract first to get going
# TODO: validations
resultsDF = results[0]
i = len(results)
print("Joining Spark DFs..")
for result in results[1:]:
print(i, end=" ", flush=True)
i -= 1
resultsDF = resultsDF.join(result, joinKey, 'outer')
return resultsDF
I considered parallelizing the joins in a merge-sort fashion using starmapasync() however, the issue is, a Spark DF cannot be returned from another thread. I also considered broadcasting the main dataframe from which all joinable single-col dataframes were created,
spark.sparkContext.broadcast(data)
but that throws the same error as attempting to return a joined DF from another thread, viz.
PicklingError: Could not serialize broadcast: Py4JError: An error
occurred while calling o3897.getstate. Trace: py4j.Py4JException:
Method getstate([]) does not exist
How can I solve this problem?
Please feel free to ask if you need any more info. Thanks in advance.

Spark dataFrame taking too long to display after updating its columns

I have a dataFrame of approx. 4 million rows and 35 columns as input.
All I do to this dataFrame is the following steps:
For a list of given columns, I calculate a sum for a given list of group features and joined it as new column to my input dataFrame
I drop each new column sum right after I joined it to the dataFrame.
Therefore we end up with the same dataFrame as we started from (in theory).
However, I noticed that if my list of given columns gets too big (from more than 6 columns), the output dataFrame becomes impossible to manipulate. Even a simple display takes 10 minutes.
Here is an example of my code (df is my input dataFrame):
for c in list_columns:
df = df.join(df.groupby(list_group_features).agg(sum(c).alias('sum_' + c)), list_group_features)
df = df.drop('sum_' + c)
This happens due to the inner workings of Spark and its lazy evaluation.
What Spark does when you call groupby, join, agg, it attaches these calls to the plan of the df object. So even though it is not executing anything on the data, you are creating a large execution plan which is internally stored in the Spark DataFrame object.
Only when you call an action (show, count, write, etc.), Spark optimizes the plan and executes it. If the plan is too large, the optimization step can take a while to perform. Also remember that the plan optimization is happening on the driver, not on the executors. So if your driver is busy or overloaded, it delays spark plan optimization step as well.
It is useful to remember that joins are expensive operations in Spark, both for optimization and execution. If you can, you should always avoid joins when operating on a single DataFrame and utilise the window functionality instead. Joins should only be used if you are joining different dataframes from different sources (different tables).
A way to optimize your code would be:
import pyspark
import pyspark.sql.functions as f
w = pyspark.sql.Window.partitionBy(list_group_features)
agg_sum_exprs = [f.sum(f.col(c)).alias("sum_" + c).over(w) for c in list_columns]
res_df = df.select(df.columns + agg_sum_exprs)
This should be scalable and fast for large list_group_features and list_columns lists.

Categories