Pyspark iterative column addition memory leak - python

I have been attempting to perform some iterative computation on pyspark dataframes. Columns are added to the df based on previous columns. However I am noting that the memory used keeps increasing. A simple example is shown below.
from pyspark import SparkContext, SparkConf
from pyspark.sql import SQLContext
from pyspark import Row
conf = SparkConf().setAppName("myFirstApp").setMaster("local")
sc = SparkContext(conf=conf)
sqlContext = SQLContext(sc)
df = [Row(Z_0=0.0, Z_1=0.0)]
df = sc.parallelize(df).toDF()
for each in range(0,400):
df = df.withColumn("Z_"+str(each+2), df['Z_'+str(each+1)]+1)
It is my understanding that I am in fact building an execution plan, not necessarily the data itself. However calling the execution of the df with collect(), count(), show() or conversion to rdd or even deletion of the df fails to release memory. I have been seeing 1.2GB of memory for the above task. It seems like garbage collection has no way of cleaning up the previous intermediate df objects, or perhaps that these objects are never de-referenced.
is there a better method of building out this type of iterative calculation, or is there any way to cleanup these intermediate df's? Note the simple +1 occurring here is just a minimum example mock of much more complex calculations.

I have been dealing with the same thing and havent come up with a good solution.
As a Temporary Solution :
Split The application to many .py files and execute them one by one ,
which will cause garbage collection to free all unnecessary cache.

I have found that you can call take() in order to remove the execution plan, leaving simply values. See the last line for the relevant call.
from pyspark import SparkContext, SparkConf
from pyspark.sql import SQLContext
from pyspark import Row
conf = SparkConf().setAppName("myFirstApp").setMaster("local")
sc = SparkContext(conf=conf)
sqlContext = SQLContext(sc)
df = [Row(Z_0=0.0, Z_1=0.0)]
df = sc.parallelize(df).toDF()
for each in range(0,1400):
df = df.withColumn("Z_"+str(each+2), df['Z_'+str(each+1)]+1)
df = sc.parallelize(df.take(df.count())).toDF()
My statement in the question about garbage collection being the issue is not quite correct. There is a difference between the heap size and the used heap. In investigation with visualVM, it was easy to see that the garbage collection is occurring, which reduces the used heap.
We see that problem that the jvm has in processing the code posted in the question. Towards the end, we have no space to move. Our heap size is maxed, and the used heap is too big at this point, with nothing to GC. This expansion was not due to the data, but instead the data lineage information being retained. What I needed to do was get rid of all of that lineage, which to be honest isn't all that useful in this problems context, and retain just the data.
The following is the profile of the answer code snippet above. Even with 1400 columns, we have little problem holding the data.

Related

Does Spark eject DataFrame data from memory every time an action is executed?

I'm trying to understand how to leverage cache() to improve my performance. Since cache retains a DataFrame in memory "for reuse", it seems like i need to understand the conditions that eject the DataFrame from memory to better understand how to leverage it.
After defining transformations, I call an action, is the dataframe, after the action completes, gone from memory? This would imply that if I do execute an action on a dataframe, but I continue to do other stuff with the data, all the previous parts of the DAG, from the read to the action, will need to be re done.
Is this accurate?
The fact is, that after an action is executed another dag is created. You can check this via SparkUI
In code you can try to identify where your dag is done and new started by looking for actions
When looking at code you can use this simple rule:
When function is transforming one df into another df - its not action but only lazy evaluated transformation. Even if this is join or something else which requires shuffling
When fuction is returning value other dan df, then you are using and action (for example count which is returning Long)
Lets take a look at this code (Its Scala but api is similar). I created this example just to show you the mechanism, this could be done better of course :)
import org.apache.spark.sql.functions.{col, lit, format_number}
import org.apache.spark.sql.DataFrame
val input = spark.read.format("csv")
.option("header", "true")
.load("dbfs:/FileStore/shared_uploads/***#gmail.com/city_temperature.csv")
val dataWithTempInCelcius: DataFrame = input
.withColumn("avg_temp_celcius",format_number((col("AvgTemperature").cast("integer") - lit(32)) / lit(1.8), 2).cast("Double"))
val dfAfterGrouping: DataFrame = dataWithTempInCelcius
.filter(col("City") === "Warsaw")
.groupBy(col("Year"), col("Month"))
.max("avg_temp_celcius")//Not an action, we are still doing transofrmation from df to another df
val maxTemp: Row = dfAfterGrouping
.orderBy(col("Year"))
.first() // <--- first is our first action, output from first is a Row and not df
dataWithTempInCelcius
.filter(col("City") === "Warsaw")
.count() // <--- count is our second action
Here you may see what is the problem. It looks like between first and second action i am doing transformation which was already done in first dag. This intermediate results of calculation was not cached, so in second dag Spark is unable to get the dataframe after filter from the memory which leads us to recomputation. Spark is going to fetch the data again, apply our filter and then calculate the count.
In SparkUI u will find two separate dags and both of them are going to read the source csv
If you cache intermediate results after first .filter(col("City") === "Warsaw") and then use this cached DF to do grouping and count you will still find two separate dags (number of action has not changed) but this time in the plan for second dag you will find "In memory table scan" instead of read of a csv file - that means that Spark is reading data from cache
Now you can see in memory relation in plan. There is still read csv node in the dag but as you can see, for second action its skipped (0 bytes read)
** I am using Databrics cluster with Spark 3.2, SparkUI may look different on your env
Quote...
Using cache() and persist() methods, Spark provides an optimization mechanism to store the intermediate computation of a Spark DataFrame so they can be reused in subsequent actions.
See https://sparkbyexamples.com/spark/spark-dataframe-cache-and-persist-explained/

Pandas parallel apply with koalas (pyspark)

I'm new to Koalas (pyspark), and I was trying to utilize Koalas for parallel apply, but it seemed like it was using a single core for the whole operation (correct me if I'm wrong) and ended up using dask for parallel apply (using map_partition) which worked pretty well.
However, I would like to know if there's a way to utilize Koalas for parallel apply.
I used basic codes for operation like below.
import pandas as pd
import databricks.koalas as ks
my_big_data = ks.read_parquet('my_big_file') # file is single partitioned parquet file
my_big_data['new_column'] = my_big_data['string_column'].apply(my_prep) # my_prep does stirng operations
my_big_data.to_parquet('my_big_file_modified') # for Koalas does lazy evaluation
I found a link that discuss this problem. https://github.com/databricks/koalas/issues/1280
If the number of rows that are being applied by function is less than 1,000 (default value), then pandas dataframe will be called to do the operation.
The user defined function above my_prep is applied to each row, so single core pandas was being used.
In order to force it to work in pyspark (parallel) manner, user should modify the configuration as below.
import databricks.koalas as ks
ks.set_option('compute.default_index_type','distributed') # when .head() call is too slow
ks.set_option('compute.shortcut_limit',1) # Koalas will apply pyspark
Also, explicitly specifying type (type hint) in the user defined function will make Koalas not to go shortcut path and will make parallel.
def my_prep(row) -> string:
return row
kdf['my_column'].apply(my_prep)

Joining multiple dataframes extremely slow on Spark

My problem is similar to one posted here (Spark join exponentially slow), but didn't help solve mine.
In summary, I am performing a simple join on seven 10x2 toy Spark Dataframes. Data is fetched from Snowflake using Spark.
Basically, they are all single column dataframes slapped with monotonically_increasing_id's to help with join. And it takes forever to return when I evaluate (count()) the result.
Here is the JOIN routine:
def combine_spark_results(results):
# Extract first to get going
# TODO: validations
resultsDF = results[0]
i = len(results)
# Start the joining dance for rest
for result in results[1:]:
print(i, end=" ", flush=True)
i -= 1
resultsDF = resultsDF.join(result, 'id', 'outer')
resultsDF = resultsDF.drop('id')
return resultsDF
resultsDF = combine_spark_results(aggr_out)
And then the picayune mammoth:
resultsDF.count()
For some reason, running count on the perceivably simple result drives Spark berserk with 1000+ tasks across 14 stages.
As mentioned before, I am working with Snowflake on Spark that enables Pushdown Optimizations by default. Since Snowflake inserts its plan into Catalyst, it's only relevant to mention this.
Here are the Spark-SF driver details:
Spark vr: 2.4.4
Python 3
spark-snowflake_2.11-2.5.2-spark_2.4.jar,
snowflake-jdbc-3.9.1.jar
I also attempted disabling pushdown but suspect it doesn't really apply (take effect).
Is there a better way of doing this? Am I missing out on something obvious?
Thanks in advance.

How do you iterate through distinct values of a column in a large Pyspark Dataframe? .distinct().collect() raises a large task warning

I am trying to iterate through all of the distinct values in column of a large Pyspark Dataframe. When I try to do it using .distinct().collect() it raises a "task too large" warning even if there are only two distinct values.
Warning message:
20/01/13 20:39:01 WARN TaskSetManager: Stage 0 contains a task of very large size (201 KB). The maximum recommended task size is 100 KB.
Here is some sample code:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('Basics').getOrCreate()
length = 200000
data = spark.createDataFrame([[float(0) for x in range(3)] for x in range(length)], ['a', 'b', 'c'])
data.select("a").distinct().collect()
# This code produces this warning
How can you iterate through distinct values in a column of a large Pyspark Dataframe without running into memory issues?
As you already know,.collect() is not a best practice. Because, it's an action which transfer all the data from executors to the driver. The problem is when you have a large dataset, the Spark executors sent a large amount of serialized data to the driver and THEN do the collection of the 2 rows. You can also take a look to the TaskSetManager which produces the warning.
In a high level, a work around for your problem could be to exchange memory with disk.You can write your dataframe with distinct values in one csv and then read it again line by line with Python or Pandas*:
data.select("a").distinct().coalesce(1).write.csv("temp.csv")
# Specifically, it's a directory with one csv.
With this solution you will not have any problem with memory.
*There are a lot of solutions about how to read a large CSV with Python or Pandas.

Un-persisting all dataframes in (py)spark

I am a spark application with several points where I would like to persist the current state. This is usually after a large step, or caching a state that I would like to use multiple times. It appears that when I call cache on my dataframe a second time, a new copy is cached to memory. In my application, this leads to memory issues when scaling up. Even though, a given dataframe is a maximum of about 100 MB in my current tests, the cumulative size of the intermediate results grows beyond the alloted memory on the executor. See below for a small example that shows this behavior.
cache_test.py:
from pyspark import SparkContext, HiveContext
spark_context = SparkContext(appName='cache_test')
hive_context = HiveContext(spark_context)
df = (hive_context.read
.format('com.databricks.spark.csv')
.load('simple_data.csv')
)
df.cache()
df.show()
df = df.withColumn('C1+C2', df['C1'] + df['C2'])
df.cache()
df.show()
spark_context.stop()
simple_data.csv:
1,2,3
4,5,6
7,8,9
Looking at the application UI, there is a copy of the original dataframe, in adition to the one with the new column. I can remove the original copy by calling df.unpersist() before the withColumn line. Is this the recommended way to remove cached intermediate result (i.e. call unpersist before every cache()).
Also, is it possible to purge all cached objects. In my application, there are natural breakpoints where I can simply purge all memory, and move on to the next file. I would like to do this without creating a new spark application for each input file.
Thank you in advance!
Spark 2.x
You can use Catalog.clearCache:
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate
...
spark.catalog.clearCache()
Spark 1.x
You can use SQLContext.clearCache method which
Removes all cached tables from the in-memory cache.
from pyspark.sql import SQLContext
from pyspark import SparkContext
sqlContext = SQLContext.getOrCreate(SparkContext.getOrCreate())
...
sqlContext.clearCache()
We use this quite often
for (id, rdd) in sc._jsc.getPersistentRDDs().items():
rdd.unpersist()
print("Unpersisted {} rdd".format(id))
where sc is a sparkContext variable.
When you use cache on dataframe it is one of the transformation and gets evaluated lazily when you perform any action on it like count(),show() etc.
In your case after doing first cache you are calling show() that is the reason the dataframe is cached in memory. Now then you are again performing transformation on the dataframe to add additional column and again caching the new dataframe and then calling the action command show again and this would cache the second dataframe in memory. In case if size of your dataframe is big enough to just hold one dataframe then when you cache the second dataframe it would remove the first dataframe from the memory as it does not have enough space to hold the second dataframe.
Thing to keep in mind: You should not cache a dataframe unless you are using it in multiple actions otherwise it would be an overload in terms of performance as caching itself is costlier operation.

Categories