Out of memory when trying to persist a dataframe

Out of memory when trying to persist a dataframe - python

I am facing an out of memory error when trying to persist a dataframe and I don't really understand why. I have a dataframe of roughly 20Gb with 2.5 millions rows and around 20 columns. After filtering this dataframe, I have 4 columns and 0.5 million rows.
Now my problem is that when I persist the filtered dataframe I get an out of memory error (exceeds 25.4Gb of 20 Gb physical memory used). I have tried persisting at different storage levels
df = spark.read.parquet(path) # 20 Gb
df_filter = df.select('a', 'b', 'c', 'd').where(df.a == something) # a few Gb
df_filter.persist(StorageLevel.MEMORY_AND_DISK)
df_filter.count()
My cluster has 8 nodes with 30Gb of memory each.
Do you have any idea where that OOM could come from ?

Just some suggestions to help identify root cause ...
You probably have either (or a combo) of ...
skewed source data partition split sizes which is tough to deal with and cause garbage collection, OOM, etc. (these methods have helped me, but there may be better approaches per use case)
# to check num partitions
df_filter.rdd.getNumPartitions()
# to repartition (**does cause shuffle**) to increase parallelism and help with data skew
df_filter.repartition(...) # monitor/debug performance in spark ui after setting
too little/too many executors/ram/cores set in config
# check via
spark.sparkContext.getConf().getAll()
# these are the ones you want to watch out for
'''
--num-executors
--executor-cores
--executor-memory
'''
wide transformation shuffles size too little/too many => try general debug checks to view transformations that will be triggered when persisting + find their # of output partitions to disk
# debug directed acyclic graph [dag]
df_filter.explain() # also "babysit" in spark UI to examine performance of each node/partitions to get specs when you are persisting
# check output partitions if shuffle occurs
spark.conf.get("spark.sql.shuffle.partitions")

Related

Spark - Skewed Input dataframe

i am working with a heavily nested non-splittable json format input dataset housed in S3.
The files can vary a lot in their sizes - minimum is 10kb while other is 300 MB.
When reading the file using the below code, and just doing a simple repartition to desired number of partitions leads to straggling tasks - most tasks finish within seconds but one would last for couple hours and then runs into memory issues (heartbeat missing/ heap space etc.)
I repartition in an attempt to randomize the partition to file mapping since spark may be reading files in sequence and files within same directory tend to have same nature -all large/all small etc.
df = spark.read.json('s3://my/parent/directory')
df.repartition(396)
# Settings (few):
default parallelism = 396
total number of cores = 400
What I tried:
I figured that the input partitions (s3 partitions not spark) scheme (folder hierarchy) might be leading to this skewed partitions problem, where some s3 folders (techinically 'prefixes') have just one file while other has thousands, so I transformed the input to a flattened directory structure using hashcodewhere each folder has just one file:
Earlier:
/parent1
/file1
/file2
.
.
/file1000
/parent2/
/file1
Now:
hashcode=FEFRE#$#$$#FE/parent1/file1
hashcode=#$#$#Cdvfvf##/parent1/file1
But it didnt have any effect.
I have tried with really large clusters too - thinking that even if there is input skew - that much memory should be able to handle the larger files. But I still get into the straggling tasks.
When I check the number of files (each file becomes a row in dataframe due to its nested - unsplittable nature) assigned to each partition - I see number of files assigned to be between 2 to 32. Is it because spark picks up the files in partitions based on spark.sql.files.maxPartitionBytes - and probably its assigning only two files where the file size is huge , and much more files to single partition when the filesize is less?
Any recommendations to make the job work properly, and distribute the tasks uniformly - given size of input files is something that can not be changed due to nature of input files.
EDIT: Code added for latest trial per Abdennacer's comment
This is the entire code.
Results:
The job gets stuck in Running - even though in worker logs I see the task finished. The driver log has error 'can not increase buffer size' - I dont know what is causing the 2GB buffer issue, since I am not issuing any 'collect' or similar statement.
Configuration for this job is like below to ensure the executor has huge memory per task.
Driver/Executor Cores: 2.
Driver/ Executor memory: 198 GB.
# sample s3 source path:
s3://masked-bucket-urcx-qaxd-dccf/gp/hash=8e71405b-d3c8-473f-b3ec-64fa6515e135/group=34/db_name=db01/2022-12-15_db01.json.gz
#####################
# Boilerplate
#####################
spark = SparkSession.builder.appName('query-analysis').getOrCreate()
#####################
# Queries Dataframe
#####################
dfq = spark.read.json(S3_STL_RAW, primitivesAsString="true")
dfq.write.partitionBy('group').mode('overwrite').parquet(S3_DEST_PATH)

Great job flattening the files to increase read speed. Prefixes as you seem to understand are related to buckets and bucket read speed is related to the number of files under each prefix and their size. The approach you took will up reading faster than you original strategy. It will not help you with skew of the data itself.
One thing you might consider is that your raw data and working data do not need to be the same set of files. There is a strategy for landing data and then pre-processing it for performance.
That is to say keep the raw data in the format that you have now, then make a copy of the data in a more convenient format for regulatory queries. (Parquet is the best choice for working with S3).
Land data a 'landing zone'
As needed process the data stored in the landing zone, into a convenient splittable format for querying. ('pre-processed folder' )
Once your raw data is processed move it to a 'processed folder'. (Use Your existing flat folder structure.) This processing table is important should you need to rebuild the table or make changes to the table format.
Create a view that is a union of data in the 'landing zone' and the 'pre-pocesssed' folder. This gives you a performant table with up to date data.
If you are using the latest S3 you should get consistent reads, that allow you to ensure you are querying on all the data. In days of the past S3 was eventually consistent meaning you might miss some data while it's in transit, this issue is supposedly fixed in the recent version of S3. Run this 'processing' as often as needed and you should have a performant table to run large queries on.
S3 was designed as a long term cheap storage. It's not made to perform quickly, but they've been trying to make it better over time.
It should be noted this will solve skew on read but won't solve skew on query. To solve the query portion you can enable Adaptive query in your config. (this will adaptively add an extra shuffle to make queries run faster.)
spark.sql.adaptive.enabled=true

How spark distributes partitions to executors

I have a performance issue and after analyzing Spark web UI i found what it seems to be data skewness:
Initially i thought partitions were not evenly distributed, so i performed an analysis of rowcount per partitions, but it seems normal(with no outliers):
how to manually run pyspark's partitioning function for debugging
But the problem persists and i see there is one executor processing most of the data:
So the hypothesis now is partitions are not evenly distributed across executors, question is: how spark distributes partitions to executors? and how can i change it to solve my skewness problem?
The code is very simple:
hive_query = """SELECT ... FROM <multiple joined hive tables>"""
df = sqlContext.sql(hive_query).cache()
print(df.count())
Update after posting this question i performed further analysis and found that there 3 tables that cause this, if they are removed the data is evenly distributed in the executors and performance improves, so i added the spark sql hint /*+ BROADCASTJOIN(<table_name>) */ and it worked, performance is much better now, but the question remains:
why do this tables(including a small 6 rows table) cause this uneven distribution across executors when added to the query ?

repartition() will not give you to evenly distributed the dataset as Spark internally uses HashPartitioner. To put your data evenly in all partitions then in my point of view Custom Partitioner is the way.
In this case you need to extend the org.apache.spark.Partitioner class and use your own logic instead of HashPartition. To achieve this we need to convert the RDD to PairRDD.
Found below blog post which will be help you in your case:
https://blog.clairvoyantsoft.com/custom-partitioning-spark-datasets-25cbd4e2d818
Thanks

When you are reading data from HDFS the number of partitions depends on the number of blocks you are reading. From the images attached it looks like your data is not distributed evenly across the cluster. Try repartitioning your data and increase tweak the number of cores and executors.
If you are repartitioning your data, the hash partitioner is returning a value which is more common than other can lead to data skew.
If this is after performing join, then your data is skewed.

Spark Repartition Executors

I have a data source, around 100GB, and I'm trying to write it partitioned using a date column.
In order to avoid small chunks inside the partitions, I've added a repartition(5) to have 5 files max inside each partition :
df.repartition(5).write.orc("path")
My problem here, is that only 5 executores out of the 30 I'm allocating are actually running. In the end I have what I want (5 files inside each partition), but since only 5 executors are running, the execution time is extremely high.
Dy you have any suggestion on how I can make it faster ?

I fixed it using simply :
df.repartition($"dateColumn").write.partitionBy("dateColumn").orc(path)
And allocating the same number of executors as the number of partitions I ll have in the output.
Thanks all

You can use repartition along with partitionBy to resolve the issue.
There are two ways to solve this.
Suppose you need to partition by dateColumn
df.repartition(5, 'dateColumn').write.partitionBy('dateColumn').parquet(path)
In this case the number of executors used will be equal to 5 * distinct(dateColumn) and all your date will contain 5 files each.
Another approach is to repartition your data 3 times no of executors then using maxRecordsPerFile to save data this will create equal sizze files but you will lose control over the number of files created
df.repartition(60).write.option('maxRecordsPerFile',200000).partitionBy('dateColumn').parquet(path)

Spark can run 1 concurrent task for every partition of an RDD or data frame (up to the number of cores in the cluster). If your cluster has 30 cores, you should have at least 30 partitions. On the other hand, a single partition typically shouldn’t contain more than 128MB and a single shuffle block cannot be larger than 2GB (see SPARK-6235).
Since you want to reduce your execution time, it is better to increase your number of partitions and at the end of your job reduce your number of partitions for your specific job.
for better distribution of your data (equally) among partition, it is better to use the hash partitioner.

Repartitioning a pyspark dataframe fails and how to avoid the initial partition size

I'm trying to tune the performance of spark, by the use of partitioning on a spark dataframe. Here is the code:
file_path1 = spark.read.parquet(*paths[:15])
df = file_path1.select(columns) \
.where((func.col("organization") == organization))
df = df.repartition(10)
#execute an action just to make spark execute the repartition step
df.first()
During the execution of first() I check the job stages in Spark UI and here what I find:
Why there is no repartition step in the stage?
Why there is also stage 8? I only requested one action of first(). Is it because of the shuffle caused by the repartition?
Is there a way to change the repartition of the parquet files without having to occur to such operations? As initially when I read the df you can see that it's partitioned over 43k partitions which is really a lot (compared to its size when I save it to a csv file: 4 MB with 13k rows) and creating problems in further steps, that's why I wanted to repartition it.
Should I use cache() after repartition? df = df.repartition(10).cache()? As when I executed df.first() the second time, I also get a scheduled stage with 43k partitions, despite df.rdd.getNumPartitions() which returned 10.
EDIT: the number of partitions is just to try. my questions are directed to help me understand how to do the right repartition.
Note: initially the Dataframe is read from a selection of parquet files in Hadoop.
I already read this as reference How does Spark partition(ing) work on files in HDFS?

Whenever there is shuffling, there is a new stage. and the
repartition causes shuffling that"s why you have two stages.
the caching is used when you'll use the dataframe multiple times to
avoid reading it twice.
Use coalesce instead of repartiton. I think it causes less shuffling since it only reduces the number of partitions.

Why am I not using all of the memory in spark?

I am wondering why my jobs are running very slowly, and it appears to be because I am not using all of the memory available in PySpark.
When I go to the spark UI, and I click on "Executors" I see the following memory used:
And when I look at my executors I see the following table:
I am wondering why the "Used" memory is so small compared to the "Total memory. What can I do to use as much of the memory as possible?
Other information:
I have a small broadcasted table, but it is only 1MB in size. It should be replicated once per each executor, so I do not imagine it would affect this that much.
I am using spark managed by yarn
I am using spark 1.6.1
Config settings are:
spark.executor.memory=45g
spark.executor.cores=2
spark.executor.instances=4
spark.sql.broadcastTimeout = 9000
spark.memory.fraction = 0.6
The dataset I am processing has 8397 rows, and 80 partitions. I am not doing any shuffle operations aside from the repartitioning initially to 80 partitions.
It is the part when I am adding columns that this becomes slow. All of the parts before that seem to be reasonably fast, when when I try to add a column using a custom udf (using withColumn) it seems to be slowing in that part.
There is a similar question here:
How can I tell if my spark job is progressing? But my question is more pointed - why does the "Memory Used" show a number so low?
Thanks.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.