Problem running a Pandas UDF on a large dataset

Problem running a Pandas UDF on a large dataset - python

I'm currently working on a project and I am having a hard time understanding how does the Pandas UDF in PySpark works.
I have a Spark Cluster with one Master node with 8 cores and 64GB, along with two workers of 16 cores each and 112GB. My dataset is quite large and divided into seven principal partitions consisting each of ~78M lines. The dataset consists of 70 columns.
I defined a Pandas UDF in to do some operations on the dataset, that can only be done using Python, on a Pandas dataframe.
The pandas UDF is defined this way :
#pandas_udf(schema, PandasUDFType.GROUPED_MAP)
def operation(pdf):
#Some operations
return pdf
spark.table("my_dataset").groupBy(partition_cols).apply(operation)
There is absolutely no way to get the Pandas UDF to work as it crashes before even doing the operations. I suspect there is an OOM error somewhere. The code above runs for a few minutes before crashing with an error code stating that the connection has reset.
However, if I call the .toPandas() function after filtering on one partition and then display it, it runs fine, with no error. The error seems to happen only when using a PandasUDF.
I fail to understand how it works. Does Spark try to convert one whole partition at once (78M lines) ? If so, what memory does it use ? The driver memory ? The executor's ? If it's on the driver's, is all Python code executed on it ?
The cluster is configured with the following :
SPARK_WORKER_CORES=2
SPARK_WORKER_MEMORY=64g
spark.executor.cores 2
spark.executor.memory 30g (to allow memory for the python instance)
spark.driver.memory 43g
Am I missing something or is there just no way to run 78M lines through a PandasUDF ?

Does Spark try to convert one whole partition at once (78M lines) ?
That's exactly what happens. Spark 3.0 adds support for chunked UDFs, which operate on iterators of Pandas DataFrames or Series, but if operations on the dataset, that can only be done using Python, on a Pandas dataframe, these might not be the right choice for you.
If so, what memory does it use ? The driver memory? The executor's?
Each partition is processed locally, on the respective executor, and data is passed to and from Python worker, using Arrow streaming.
Am I missing something or is there just no way to run 78M lines through a PandasUDF?
As long as you have enough memory to handle Arrow input, output (especially if data is copied), auxiliary data structures, as well as as JVM overhead, it should handle large datasets just fine.
But on such tiny cluster, you'll be better with partitioning the output and reading data directly with Pandas, without using Spark at all. This way you'll be able to use all the available resources (i.e. > 100GB / interpreter) for data processing instead of wasting these on secondary tasks (having 16GB - overhead / interpreter).

To answer the general question about using a Pandas UDF on a large pyspark dataframe:
If you're getting out-of-memory errors such as
java.lang.OutOfMemoryError : GC overhead limit exceeded or java.lang.OutOfMemoryError: Java heap space and increasing memory limits hasn't worked, ensure that pyarrow is enabled. It is disabled by default.
In pyspark, you can enable it using:
spark.conf.set("spark.sql.execution.arrow.pyspark.enabled", "true")
More info here.

Related

Why does running compute() on a filtered Dask dataframe take so long?

I'm reading in data using this:
ddf1 = dd.read_sql_table('mytable', conn_string, index_col='id', npartitions=8)
Of course, this runs instantaneously due to lazy computation. This table has several hundred million rows.
Next, I want to filter this Dask dataframe:
ddf2 = ddf1.query('some_col == "converted"')
Finally, I want to convert this to a Pandas dataframe. The result should only be about 8000 rows:
ddf3 = ddf2.compute()
However, this is taking very long (~1 hour). Can I get any advice on how to substantially speed this up? I've tried using .compute(scheduler='threads'), changing up the number of partitions, but none have worked so far. What am I doing wrong?

Firstly, you may be able to use sqlalchemy expression syntax to encode your filter clause in the query, and do the filtering server-side. If data transfer is your bottleneck, than that is your best solution, especially is the filter column is indexed.
Depending on your DB backend, sqlalchemy probably does not release the GIL, so your partitions cannot run in parallel in threads. All you are getting is contention between the threads and extra overhead. You should use the distributed scheduler with processes.
Of course, please look at your CPU and memory usage; with the distributed scheduler, you also have access to the diagnostic dashboard. You should also be concerned with how big each partition will be in memory.

Repartitioning a pyspark dataframe fails and how to avoid the initial partition size

I'm trying to tune the performance of spark, by the use of partitioning on a spark dataframe. Here is the code:
file_path1 = spark.read.parquet(*paths[:15])
df = file_path1.select(columns) \
.where((func.col("organization") == organization))
df = df.repartition(10)
#execute an action just to make spark execute the repartition step
df.first()
During the execution of first() I check the job stages in Spark UI and here what I find:
Why there is no repartition step in the stage?
Why there is also stage 8? I only requested one action of first(). Is it because of the shuffle caused by the repartition?
Is there a way to change the repartition of the parquet files without having to occur to such operations? As initially when I read the df you can see that it's partitioned over 43k partitions which is really a lot (compared to its size when I save it to a csv file: 4 MB with 13k rows) and creating problems in further steps, that's why I wanted to repartition it.
Should I use cache() after repartition? df = df.repartition(10).cache()? As when I executed df.first() the second time, I also get a scheduled stage with 43k partitions, despite df.rdd.getNumPartitions() which returned 10.
EDIT: the number of partitions is just to try. my questions are directed to help me understand how to do the right repartition.
Note: initially the Dataframe is read from a selection of parquet files in Hadoop.
I already read this as reference How does Spark partition(ing) work on files in HDFS?

Whenever there is shuffling, there is a new stage. and the
repartition causes shuffling that"s why you have two stages.
the caching is used when you'll use the dataframe multiple times to
avoid reading it twice.
Use coalesce instead of repartiton. I think it causes less shuffling since it only reduces the number of partitions.

Why am I not using all of the memory in spark?

I am wondering why my jobs are running very slowly, and it appears to be because I am not using all of the memory available in PySpark.
When I go to the spark UI, and I click on "Executors" I see the following memory used:
And when I look at my executors I see the following table:
I am wondering why the "Used" memory is so small compared to the "Total memory. What can I do to use as much of the memory as possible?
Other information:
I have a small broadcasted table, but it is only 1MB in size. It should be replicated once per each executor, so I do not imagine it would affect this that much.
I am using spark managed by yarn
I am using spark 1.6.1
Config settings are:
spark.executor.memory=45g
spark.executor.cores=2
spark.executor.instances=4
spark.sql.broadcastTimeout = 9000
spark.memory.fraction = 0.6
The dataset I am processing has 8397 rows, and 80 partitions. I am not doing any shuffle operations aside from the repartitioning initially to 80 partitions.
It is the part when I am adding columns that this becomes slow. All of the parts before that seem to be reasonably fast, when when I try to add a column using a custom udf (using withColumn) it seems to be slowing in that part.
There is a similar question here:
How can I tell if my spark job is progressing? But my question is more pointed - why does the "Memory Used" show a number so low?
Thanks.

Iterate through S3 files in Spark

Problem:
Large number of files. Each file is 10MB and consist of records in json format, gzipped.
My snippet is loading all the data into memory. There is no need to do this. I just need a few hours of data in memory at a time. I need a sliding window.
Is it possible to apply the 'window' idea from spark streaming to the files and how would I do this?
I'm using python
location = "s3://bucketname/xxxx/2016/10/1[1-2]/*/file_prefix*.gz"
rdd = sc.textFile(location)

The snippet you posted actually does no computation. Spark execution is lazy, and only forces computation of "transformations" like maps, filters, and even textFiles when you ask for a result -- counting the RDD for example.
Another note is that most Spark operations stream by default. If you have 300 10M json files, you're going to get 300 separate partitions or tasks. If you're willing to wait, you could perform most RDD operations on this dataset on one core.
If you need a sliding window, then there's good functionality for that in the Spark streaming package. But the snippet you posted has no problems as it is!

What is the Spark DataFrame method `toPandas` actually doing?

I'm a beginner of Spark-DataFrame API.
I use this code to load csv tab-separated into Spark Dataframe
lines = sc.textFile('tail5.csv')
parts = lines.map(lambda l : l.strip().split('\t'))
fnames = *some name list*
schemaData = StructType([StructField(fname, StringType(), True) for fname in fnames])
ddf = sqlContext.createDataFrame(parts,schemaData)
Suppose I create DataFrame with Spark from new files, and convert it to pandas using built-in method toPandas(),
Does it store the Pandas object to local memory?
Does Pandas low-level computation handled all by Spark?
Does it exposed all pandas dataframe functionality?(I guess yes)
Can I convert it toPandas and just be done with it, without so much touching DataFrame API?

Using spark to read in a CSV file to pandas is quite a roundabout method for achieving the end goal of reading a CSV file into memory.
It seems like you might be misunderstanding the use cases of the technologies in play here.
Spark is for distributed computing (though it can be used locally). It's generally far too heavyweight to be used for simply reading in a CSV file.
In your example, the sc.textFile method will simply give you a spark RDD that is effectively a list of text lines. This likely isn't what you want. No type inference will be performed, so if you want to sum a column of numbers in your CSV file, you won't be able to because they are still strings as far as Spark is concerned.
Just use pandas.read_csv and read the whole CSV into memory. Pandas will automatically infer the type of each column. Spark doesn't do this.
Now to answer your questions:
Does it store the Pandas object to local memory:
Yes. toPandas() will convert the Spark DataFrame into a Pandas DataFrame, which is of course in memory.
Does Pandas low-level computation handled all by Spark
No. Pandas runs its own computations, there's no interplay between spark and pandas, there's simply some API compatibility.
Does it exposed all pandas dataframe functionality?
No. For example, Series objects have an interpolate method which isn't available in PySpark Column objects. There are many many methods and functions that are in the pandas API that are not in the PySpark API.
Can I convert it toPandas and just be done with it, without so much touching DataFrame API?
Absolutely. In fact, you probably shouldn't even use Spark at all in this case. pandas.read_csv will likely handle your use case unless you're working with a huge amount of data.
Try to solve your problem with simple, low-tech, easy-to-understand libraries, and only go to something more complicated as you need it. Many times, you won't need the more complex technology.

Using some spark context or hive context method (sc.textFile(), hc.sql()) to read data 'into memory' returns an RDD, but the RDD remains in distributed memory (memory on the worker nodes), not memory on the master node. All the RDD methods (rdd.map(), rdd.reduceByKey(), etc) are designed to run in parallel on the worker nodes, with some exceptions. For instance, if you run a rdd.collect() method, you end up copying the contents of the rdd from all the worker nodes to the master node memory. Thus you lose your distributed compute benefits (but can still run the rdd methods).
Similarly with pandas, when you run toPandas(), you copy the data frame from distributed (worker) memory to the local (master) memory and lose most of your distributed compute capabilities. So, one possible workflow (that I often use) might be to pre-munge your data into a reasonable size using distributed compute methods and then convert to a Pandas data frame for the rich feature set. Hope that helps.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.