Spark memory leak when overwriting dataframe variable - python

I'm running into this memory leak in the spark driver that I can't seem to figure out why. My guess is it has something to do with trying to overwrite a DataFrame variable but I can't find any documentation or other issues like this.
This is on Spark 2.1.0 (PySpark).
from pyspark.sql import SparkSession
from pyspark.sql import SQLContext
from pyspark.sql.types import *
spark = SparkSession \
.builder \
.appName("Spark Leak") \
.getOrCreate()
sc = spark.sparkContext
sqlContext = SQLContext.getOrCreate(sc)
items = 5000000
data = [str(x) for x in range(1,items)]
df = sqlContext.createDataFrame(data, StringType())
print(df.count())
for x in range(0,items):
sub_df = sqlContext.createDataFrame([str(x)], StringType())
df = df.subtract(sub_df)
print(df.count())
This will continue to run until the driver runs out of memory then dies.
java.net.SocketException: Connection reset
at java.net.SocketInputStream.read(SocketInputStream.java:210)
at java.net.SocketInputStream.read(SocketInputStream.java:141)
at java.net.SocketInputStream.read(SocketInputStream.java:224)
at org.apache.spark.api.python.PythonAccumulatorV2.merge(PythonRDD.scala:917)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$updateAccumulators$1.apply(DAGScheduler.scala:1089)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$updateAccumulators$1.apply(DAGScheduler.scala:1081)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at org.apache.spark.scheduler.DAGScheduler.updateAccumulators(DAGScheduler.scala:1081)
at org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:1184)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1717)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1675)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1664)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
17/05/25 16:55:40 ERROR DAGScheduler: Failed to update accumulators for task 13
java.net.SocketException: Broken pipe (Write failed)
at java.net.SocketOutputStream.socketWrite0(Native Method)
at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:111)
at java.net.SocketOutputStream.write(SocketOutputStream.java:155)
at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)
at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:140)
at java.io.DataOutputStream.flush(DataOutputStream.java:123)
at org.apache.spark.api.python.PythonAccumulatorV2.merge(PythonRDD.scala:915)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$updateAccumulators$1.apply(DAGScheduler.scala:1089)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$updateAccumulators$1.apply(DAGScheduler.scala:1081)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at org.apache.spark.scheduler.DAGScheduler.updateAccumulators(DAGScheduler.scala:1081)
at org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:1184)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1717)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1675)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1664)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
...
If anything, I'd think memory should shrink since items are being removed from the DataFrame but that is not the case.
Am I not understanding how spark assigns DataFrames to Python variables or something?
I've also tried to assign the df.subtract to a new temporary variable, then unpersisting df then assigning the temporary var to df and unpersisting the temp var but that also has the same issues.

The fundamental problem here seems to be understanding what exactly DataFrame (this applies to Spark RDDs as well). A local DataFrame object effectively describes a computation which is to be performed, when some action is executed on a given object.
As a result it is a recursive structure, which captures all its dependencies. effectively execution plan with each iteration. While Spark provides tools, like checkpointing, which can be used to address this problem and cut the lineage, the code in question doesn't make much sense in the first place.
Distributed data structures available in Spark are designed for high latency, IO intensive jobs. Parallelizing individual objects, executing millions of Spark jobs on millions of distributed objects just cannot work well.
Furthermore Spark is not designed for efficient, single-item operations. Each subtract is O(N) making a whole process O(N2), and effectively useless on any large dataset.
While trivially itself, a correct way to do it, would be something like this:
items = 5000000
df1 = spark.range(items).selectExpr("cast(id as string)")
df2 = spark.range(items).selectExpr("cast(id as string)")
df1.subtract(df2)

Related

Asynchronous processing in spark pipeline

I have a local linux server which contains 4 cores. I am running a pyspark job on it locally which basically reads two tables from database and saves the data into 2 dataframes. Now i am using these 2 dataframes to do some processing and then i am using the resultant processed df to save it into elasticsearch. Below is the code
def save_to_es(df):
df.write.format('es').option('es.nodes', 'es_node').option('es.port', some_port_no.).option('es.resource', index_name).option('es.mapping', es_mappings).save()
def coreFun():
spark = SparkSession.builder.master("local[1]").appName('test').getOrCreate()
spark.catalog.clearCache()
spark.sparkContext.setLogLevel("ERROR")
sc = spark.sparkContext
sqlContext = SQLContext(sc)
select_sql = """(select * from db."master_table")"""
df_master = spark.read.format("jdbc").option("url", "jdbcurl").option("dbtable", select_sql).option("user", "username").option("password", "password").option("driver", "database_driver").load()
select_sql_child = """(select * from db."child_table")"""
df_child = spark.read.format("jdbc").option("url", "jdbcurl").option("dbtable", select_sql_cost).option("user", "username").option("password", "password").option("driver", "database_driver").load()
merged_df = merged_python_file.merged_function(df_master,df_child,sqlContext)
logic1_df = logic1_python_file.logic1_function(df_master,sqlContext)
logic2_df = logic2_python_file.logic2_function(df_master,sqlContext)
logic3_df = logic3_python_file.logic3_function(df_master,sqlContext)
logic4_df = logic4_python_file.logic4_function(df_master,sqlContext)
logic5_df = logic5_python_file.logic5_function(df_master,sqlContext)
save_to_es(merged_df)
save_to_es(logic1_df)
save_to_es(logic2_df)
save_to_es(logic3_df)
save_to_es(logic4_df)
save_to_es(logic5_df)
end_time = int(time.time())
print(end_time-start_time)
sc.stop()
if __name__ == "__main__":
coreFun()
There are different logic for processing written in separate python files e.g logic1 in logic1_python_file etc. I send my df_master to separate functions and they return resultant processed df back to driver. Now i use this resultant processed df to save into elasticsearch.
It works fine but problem is here everything is happening sequentially first merged_df gets processed and while it is getting processed others simply wait even though they are not really dependent on the o/p of merged_df function and then logic_1 gets processed while others wait and it goes on. This is not an ideal system design considering the o/p of one logic is not dependent on other.
I am sure asynchronous processing can help me here but i am not sure how to implement it here in my usecase. I know i may have to use some kind of queue(jms,kafka etc) to accomplish this but i dont have a complete picture.
Please let me know how can i utilize asynchronous processing here. Any other inputs which can help in improving the performance of job is welcome.
If during the processing of one single step like (merged_python_file.merged_function), only one core of the CPU is getting heavily utilized and others are nearly idle, multiprocessing can speed up. It can be achieved by using multiprocessing module of python. For more details can check answer on How to do parallel programming in Python?

topandas() using pyarrow returns empty dataframe

I have a spark dataframe with 5 million rows and 250 columns. When I do topandas() conversion of this dataframe with "spark.sql.execution.arrow.enabled" as "true" it returns a empty dataframe with just the columns.
With pyarrow disabled I get below error
Py4JJavaError: An error occurred while calling o124.collectToPython. : java.lang.OutOfMemoryError: GC overhead limit exceeded
Is there any way to perform this operation with increasing some sort of memory allocation?
I couldn't find any online resoruces for this except https://issues.apache.org/jira/browse/SPARK-28881 which is not that helpful
Ok this problem is related to the memory because while converting a spark dataframe to pandas dataframe spark (Py4j) has to pass throw the collect which consumes a lot of memory, so what I advise you to do is while creating the spark session just reconfigure the memory, here an example :
from pyspark import SparkContext
SparkContext.setSystemProperty('spark.executor.memory', '16g')
sc = SparkContext("local", "stack_over_flow")
Proceed with sc (spark context ) or with spark session as you like
If this will not work it may be a version conflict so please check those options :
-Set the environment variable ARROW_PRE_0_15_IPC_FORMAT=1 from where you are using Python
-Downgrade to pyarrow < 0.15.0 for now.
If you can share your script it will be more clear

Cant fit dataframe with fbprophet using dask to read the csv into a dataframe

References:
https://examples.dask.org/applications/forecasting-with-prophet.html?highlight=prophet
https://facebook.github.io/prophet/
A few things to note:
I've got a total of 48gb of ram
Here are my versions of the libraries im using
Python 3.7.7
dask==2.18.0
fbprophet==0.6
pandas==1.0.3
The reason im import pandas is for this line only pd.options.mode.chained_assignment = None
This helps with dask erroring when im using dask.distributed
So, I have a 21gb csv file that I am reading using dask and jupyter notebook...
I've tried to read it from my mysql database table, however, the kernel eventually crashes
I've tried multiple combinations of using my local network of workers, threads, and available memory, available storage_memory, and even tried not using distributed at all. I have also tried chunking with pandas (not with the line mentioned above related to pandas), however, even with chunking, the kernel still crashes...
I can now load the csv with dask, and apply a few transformations, such as setting the index, adding the column (names) that fbprophet requires... but I am still not able to compute the dataframe with df.compute(), as this is why I think I am receiving the error I am with fbprophet. After I have added the columns y, and ds, with the appropriate dtypes, I receive the error Truth of Delayed objects is not supported, and I think this is because fbprophet expects the dataframe to not be lazy, which is why im trying to run compute beforehand. I have also bumped up the ram on the client to allow it to use the full 48gb, as I suspected that it may be trying to load the data twice, however, this still failed, so most likely this wasn't the case / isn't causing the problem.
Alongside this, fbpropphet is also mentioned in the documentation of dask for applying machine learning to dataframes, however, I really don't understand why this isn't working... I've also tried modin with ray, and with dask, with basically the same result.
Another question... regarding memory usage
distributed.worker - WARNING - Memory use is high but worker has no data to store to disk. Perhaps some other process is leaking memory? Process memory: 32.35 GB -- Worker memory limit: 25.00 GB
I am getting this error when assigning the client, reading the csv file, and applying operations/transformations to the dataframe, however the allotted size is larger than the csv file itself, so this confuses me...
What I have done to try and solve this myself:
- Googling of course, did not find anything :-/
- Asking a discord help channel, on multiple occasions
- Asking IIRC help channel, on multiple occasions
Anyways, would really appreciate any help on this problem!!!
Thank you in advance :)
MCVE
from dask.distributed import Client
import dask.dataframe as dd
import pandas as pd
from fbprophet import Prophet
pd.options.mode.chained_assignment = None
client = Client(n_workers=2, threads_per_worker=4, processes=False, memory_limit='4GB')
csv_file = 'provide_your_own_csv_file_here.csv'
df = dd.read_csv(csv_file, parse_dates=['Time (UTC)'])
df = df.set_index('Time (UTC)')
df['y'] = df[['a','b']].mean(axis=1)
m = Prophet(daily_seasonality=True)
m.fit(df)
# ERROR: Truth of Delayed objects is not supported
Unfortunately Prophet doesn't support Dask dataframes today.
The example that you refer to shows using Dask to accelerate Prophet's fitting on Pandas dataframes. Dask Dataframe is only one way that people use Dask.
As already suggested, one approach is to use dask.delayed with a pandas DataFrame, and skip dask.dataframe.
You could use a simplified version of the load-clean-analyze pipeline shown for custom computations using Dask.
Here is one possible approach based on this type of custom pipeline, using a small dataset (to create a MCVE) - every step in the pipeline will be delayed
Imports
import numpy as np
import pandas as pd
from dask import delayed
from dask.distributed import Client
from fbprophet import Prophet
Generate some data in a .csv, with column names Time (UTC), a and b
def generate_csv(nrows, fname):
df = pd.DataFrame(np.random.rand(nrows, 2), columns=["a", "b"])
df["Time (UTC)"] = pd.date_range(start="1850-01-01", periods=nrows)
df.to_csv(fname, index=False)
First write the load function from the pipeline, to load the .csv with Pandas, and delay its execution using the dask.delayed decorator
might be good to use read_csv with nrows to see how the pipeline performs on a subset of the data, rather than loading it all
this will return a dask.delayed object and not a pandas.DataFrame
#delayed
def load_data(fname, nrows=None):
return pd.read_csv(fname, nrows=nrows)
Now create the process function, to process data using pandas, again delayed since its input is a dask.delayed object and not a pandas.DataFrame
#delayed
def process_data(df):
df = df.rename(columns={"Time (UTC)": "ds"})
df["y"] = df[["a", "b"]].mean(axis=1)
return df
Last function - this one will train fbprophet on the data (loaded from the .csv and processed, but delayed) to make a forecast. This analyze function is also delayed, since one of its inputs is a dask.delayed object
#delayed
def analyze(df, horizon):
m = Prophet(daily_seasonality=True)
m.fit(df)
future = m.make_future_dataframe(periods=horizon)
forecast = m.predict(future)
return forecast
Run the pipeline (if running from a Python script, requires __name__ == "__main__")
the output of the pipeline (a forecast by fbprophet) is stored in a variable result, which is delayed
when this output is computed, this will generate a pandas.DataFrame (corresponding to the output of a forecast by fbprophet), so it can be evaluated using result.compute()
if __name__ == "__main__":
horizon = 8
num_rows_data = 40
num_rows_to_load = 35
csv_fname = "my_file.csv"
generate_csv(num_rows_data, csv_fname)
client = Client() # modify this as required
df = load_data(csv_fname, nrows=num_rows_to_load)
df = process_data(df)
result = analyze(df, horizon)
forecast = result.compute()
client.close()
assert len(forecast) == num_rows_to_load + horizon
print(forecast[["ds", "yhat", "yhat_lower", "yhat_upper"]].head())
Output
ds yhat yhat_lower yhat_upper
0 1850-01-01 0.330649 0.095788 0.573378
1 1850-01-02 0.493025 0.266692 0.724632
2 1850-01-03 0.573344 0.348953 0.822692
3 1850-01-04 0.491388 0.246458 0.712400
4 1850-01-05 0.307939 0.066030 0.548981

PySpark's DataFrame.show() runs slow

Newbie here, I read a table(about 2 million rows) as Spark's DataFrame via JDBC from MySQL in PySpark and trying to show the top 10 rows:
from pyspark.sql import SparkSession
spark_session = SparkSession.builder.master("local[4]").appName("test_log_processing").getOrCreate()
url = "jdbc:mysql://localhost:3306"
table = "test.fakelog"
properties = {"user": "myUser", "password": "********"}
df = spark_session.read.jdbc(url, table, properties=properties)
df.cache()
df.show(10) # can't get the printed results, and runs pretty slow and consumes 90%+ CPU resources
spark_session.stop()
And here's the console log:
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
[Stage 0:> (0 + 1) / 1]
My education background is statistics and recently just started to learn Spark so I've no idea what's going on behind the code(for smallere dataset, this works well), how should I fix this problem? Or what more knowledge should I know about Spark?
Since you call the spark.read.jdbc for some table, the spark will try to collect the whole table from the database into the spark. After that, spark cache the data and print 10 result from the cache. If you run the below code, you will notice some differences.
spark_session = SparkSession.builder.master("local[4]").appName("test_log_processing").getOrCreate()
url = "jdbc:mysql://localhost:3306"
table = "(SELECT * FROM test.fakelog LIMIT 10) temp"
properties = {"user": "myUser", "password": "********"}
df = spark_session.read.jdbc(url, table, properties=properties)
df.cache()
df.show()
spark_session.stop()
Maybe your cache to memory is getting filled up, the default for cache used to be memory only(older spark versions).
Therefore, Instead of cache can you try df.persist(StorageLevel.MEMORY_AND_DISK). It will spill to disk when memory gets too full.
Try .take(10), it will give a collection of rows, it might not be faster but its worth a try
Try df.coalesce(50).persist(StorageLevel.MEMORY_AND_DISK), works well without a shuffle if you have an over-partitioned dataframe
If none of these work it probably means your computing cluster is incapable of handling this load, you might need to scale out.

Spark 1.4 increase maxResultSize memory

I am using Spark 1.4 for my research and struggling with the memory settings. My machine has 16GB of memory so no problem there since the size of my file is only 300MB. Although, when I try to convert Spark RDD to panda dataframe using toPandas() function I receive the following error:
serialized results of 9 tasks (1096.9 MB) is bigger than spark.driver.maxResultSize (1024.0 MB)
I tried to fix this changing the spark-config file and still getting the same error. I've heard that this is a problem with spark 1.4 and wondering if you know how to solve this. Any help is much appreciated.
You can set spark.driver.maxResultSize parameter in the SparkConf object:
from pyspark import SparkConf, SparkContext
# In Jupyter you have to stop the current context first
sc.stop()
# Create new config
conf = (SparkConf()
.set("spark.driver.maxResultSize", "2g"))
# Create new context
sc = SparkContext(conf=conf)
You should probably create a new SQLContext as well:
from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)
From the command line, such as with pyspark, --conf spark.driver.maxResultSize=3g can also be used to increase the max result size.
Tuning spark.driver.maxResultSize is a good practice considering the running environment. However, it is not the solution to your problem as the amount of data may change time by time. As #Zia-Kayani mentioned, it is better to collect data wisely. So if you have a DataFrame df, then you can call df.rdd and do all the magic stuff on the cluster, not in the driver. However, if you need to collect the data, I would suggest:
Do not turn on spark.sql.parquet.binaryAsString. String objects take more space
Use spark.rdd.compress to compress RDDs when you collect them
Try to collect it using pagination. (code in Scala, from another answer Scala: How to get a range of rows in a dataframe)
long count = df.count()
int limit = 50;
while(count > 0){
df1 = df.limit(limit);
df1.show(); //will print 50, next 50, etc rows
df = df.except(df1);
count = count - limit;
}
Looks like you are collecting the RDD, So it will definitely collect all the data to driver node that's why you are facing this issue.
You have to avoid collect data if not required for a rdd, or if its necessary then specify spark.driver.maxResultSize. there are two ways of defining this variable
1 - create Spark Config by setting this variable as
conf.set("spark.driver.maxResultSize", "3g")
2 - or set this variable
in spark-defaults.conf file present in conf folder of spark. like
spark.driver.maxResultSize 3g and restart the spark.
while starting the job or terminal, you can use
--conf spark.driver.maxResultSize="0"
to remove the bottleneck
There is also a Spark bug
https://issues.apache.org/jira/browse/SPARK-12837
that gives the same error
serialized results of X tasks (Y MB) is bigger than spark.driver.maxResultSize
even though you may not be pulling data to the driver explicitly.
SPARK-12837 addresses a Spark bug that accumulators/broadcast variables prior to Spark 2 were pulled to driver unnecessary causing this problem.
You can set spark.driver.maxResultSize to 2GB when you start the pyspark shell:
pyspark --conf "spark.driver.maxResultSize=2g"
This is for allowing 2Gb for spark.driver.maxResultSize

Categories