Pyspark alternative to spark.lapply? - python

I have a compute intensive python function called repeatedly within a for loop (each iteration is independent i.e. embarrassingly parallel). I am looking for spark.lapply (from SparkR) kind of functionality to utilize the Spark cluster.

Native Spark
If you use Spark data frames and libraries, then Spark will natively parallelize and distribute your task.
Thread Pools
One of the ways that you can achieve parallelism in Spark without using Spark data frames is by using the multiprocessing library.However, by default all of your code will run on the driver node.
Pandas UDFs
One of the newer features in Spark that enables parallel processing is Pandas UDFs. With this feature, you can partition a Spark data frame into smaller data sets that are distributed and converted to Pandas objects, where your function is applied, and then the results are combined back into one large Spark data frame.
Example from https://databricks.com/blog/2017/10/30/introducing-vectorized-udfs-for-pyspark.html
from pyspark.sql.functions import udf
# Use udf to define a row-at-a-time udf
#udf('double')
# Input/output are both a single double value
def plus_one(v):
return v + 1
df.withColumn('v2', plus_one(df.v))

Related

Aggregation/groupby using pymongo versus using pandas

I have an API that loads data from MongoDB (with pymongo) and applies relatively "complex" data transformations with pandas afterwards, such as groupby on datetime columns, parametrizing the frequency and other stuff. Since I'm more expert in pandas than mongo, I do prefer doing it as it is, but I have no idea if writing these transformations as mongo aggregate queries would be significantly faster.
To simplify the question, not considering the difficulty on writing the queries on both sides: it's faster doing a [simple groupby on mongo and select * results] or [select * and doing it in pandas/dask(in a distributed scenario)]? Is the former faster/slower than the second in large datasets or smaller?
In general aggregation will be much faster than Pandas because the aggregation pipeline is:
Uploaded to the server and processed there eliminating network latency associated with round trips
Optimised internally to achieve the fastest execution
Executed in C on the server will all the available threading as opposed to running
on a single thread in Pandas
Working on the same dataset within the same process which means your working set
will be in memory eliminating disk accesses
As a stop gap I recommend pre-processing your data into a new collection using $out and then using Panda to process that.

Problem running a Pandas UDF on a large dataset

I'm currently working on a project and I am having a hard time understanding how does the Pandas UDF in PySpark works.
I have a Spark Cluster with one Master node with 8 cores and 64GB, along with two workers of 16 cores each and 112GB. My dataset is quite large and divided into seven principal partitions consisting each of ~78M lines. The dataset consists of 70 columns.
I defined a Pandas UDF in to do some operations on the dataset, that can only be done using Python, on a Pandas dataframe.
The pandas UDF is defined this way :
#pandas_udf(schema, PandasUDFType.GROUPED_MAP)
def operation(pdf):
#Some operations
return pdf
spark.table("my_dataset").groupBy(partition_cols).apply(operation)
There is absolutely no way to get the Pandas UDF to work as it crashes before even doing the operations. I suspect there is an OOM error somewhere. The code above runs for a few minutes before crashing with an error code stating that the connection has reset.
However, if I call the .toPandas() function after filtering on one partition and then display it, it runs fine, with no error. The error seems to happen only when using a PandasUDF.
I fail to understand how it works. Does Spark try to convert one whole partition at once (78M lines) ? If so, what memory does it use ? The driver memory ? The executor's ? If it's on the driver's, is all Python code executed on it ?
The cluster is configured with the following :
SPARK_WORKER_CORES=2
SPARK_WORKER_MEMORY=64g
spark.executor.cores 2
spark.executor.memory 30g (to allow memory for the python instance)
spark.driver.memory 43g
Am I missing something or is there just no way to run 78M lines through a PandasUDF ?
Does Spark try to convert one whole partition at once (78M lines) ?
That's exactly what happens. Spark 3.0 adds support for chunked UDFs, which operate on iterators of Pandas DataFrames or Series, but if operations on the dataset, that can only be done using Python, on a Pandas dataframe, these might not be the right choice for you.
If so, what memory does it use ? The driver memory? The executor's?
Each partition is processed locally, on the respective executor, and data is passed to and from Python worker, using Arrow streaming.
Am I missing something or is there just no way to run 78M lines through a PandasUDF?
As long as you have enough memory to handle Arrow input, output (especially if data is copied), auxiliary data structures, as well as as JVM overhead, it should handle large datasets just fine.
But on such tiny cluster, you'll be better with partitioning the output and reading data directly with Pandas, without using Spark at all. This way you'll be able to use all the available resources (i.e. > 100GB / interpreter) for data processing instead of wasting these on secondary tasks (having 16GB - overhead / interpreter).
To answer the general question about using a Pandas UDF on a large pyspark dataframe:
If you're getting out-of-memory errors such as
java.lang.OutOfMemoryError : GC overhead limit exceeded or java.lang.OutOfMemoryError: Java heap space and increasing memory limits hasn't worked, ensure that pyarrow is enabled. It is disabled by default.
In pyspark, you can enable it using:
spark.conf.set("spark.sql.execution.arrow.pyspark.enabled", "true")
More info here.

Python multiprocessing tool vs Py(Spark)

A newbie question, as I get increasingly confused with pyspark. I want to scale an existing python data preprocessing and data analysis pipeline. I realize if I partition my data with pyspark, I can't treat each partition as a standalone pandas data frame anymore, and need to learn to manipulate with pyspark.sql row/column functions, and change a lot of existing code, plus I am bound to spark mllib libraries and can't take full advantage of more mature scikit-learn package. Then why would I ever need to use Spark if I can use multiprocessing tools for cluster computing and parallelize tasks on existing dataframe?
True, Spark does have the limitations you have mentioned, that is you are bounded in the functional spark world (spark mllib, dataframes etc). However, what it provides vs other multiprocessing tools/libraries is the automatic distribution, partition and rescaling of parallel tasks. Scaling and scheduling spark code becomes an easier task than having to program your custom multiprocessing code to respond to larger amounts of data + computations.

How Python data structure implemented in Spark when using PySpark?

I am currently self-learning Spark programming and trying to recode an existing Python application in PySpark. However, I am still confused about how we use regular Python objects in PySpark.
I understand the distributed data structure in Spark such as the RDD, DataFrame, Datasets, vector, etc. Spark has its own transformation operations and action operations such as .map(), .reduceByKey() to manipulate those objects. However, what if I create traditional Python data objects such as array, list, tuple, or dictionary in PySpark? They will be only stored in the memory of my driver program node, right? If I transform them into RDD, can i still do operations with typical Python function?
If I have a huge dataset, can I use regular Python libraries like pandas or numpy to process it in PySpark? Will Spark only use the driver node to run the data if I directly execute Python function on a Python object in PySpark? Or I have to create it in RDD and use Spark's operations?
You can create traditional Python data objects such as array, list, tuple, or dictionary in PySpark.
You can perform most of the operations using python functions in Pyspark.
You can import Python libraries in Pyspark and use them to process data in Pyspark
You can create a RDD and apply spark operations on them

What is the Spark DataFrame method `toPandas` actually doing?

I'm a beginner of Spark-DataFrame API.
I use this code to load csv tab-separated into Spark Dataframe
lines = sc.textFile('tail5.csv')
parts = lines.map(lambda l : l.strip().split('\t'))
fnames = *some name list*
schemaData = StructType([StructField(fname, StringType(), True) for fname in fnames])
ddf = sqlContext.createDataFrame(parts,schemaData)
Suppose I create DataFrame with Spark from new files, and convert it to pandas using built-in method toPandas(),
Does it store the Pandas object to local memory?
Does Pandas low-level computation handled all by Spark?
Does it exposed all pandas dataframe functionality?(I guess yes)
Can I convert it toPandas and just be done with it, without so much touching DataFrame API?
Using spark to read in a CSV file to pandas is quite a roundabout method for achieving the end goal of reading a CSV file into memory.
It seems like you might be misunderstanding the use cases of the technologies in play here.
Spark is for distributed computing (though it can be used locally). It's generally far too heavyweight to be used for simply reading in a CSV file.
In your example, the sc.textFile method will simply give you a spark RDD that is effectively a list of text lines. This likely isn't what you want. No type inference will be performed, so if you want to sum a column of numbers in your CSV file, you won't be able to because they are still strings as far as Spark is concerned.
Just use pandas.read_csv and read the whole CSV into memory. Pandas will automatically infer the type of each column. Spark doesn't do this.
Now to answer your questions:
Does it store the Pandas object to local memory:
Yes. toPandas() will convert the Spark DataFrame into a Pandas DataFrame, which is of course in memory.
Does Pandas low-level computation handled all by Spark
No. Pandas runs its own computations, there's no interplay between spark and pandas, there's simply some API compatibility.
Does it exposed all pandas dataframe functionality?
No. For example, Series objects have an interpolate method which isn't available in PySpark Column objects. There are many many methods and functions that are in the pandas API that are not in the PySpark API.
Can I convert it toPandas and just be done with it, without so much touching DataFrame API?
Absolutely. In fact, you probably shouldn't even use Spark at all in this case. pandas.read_csv will likely handle your use case unless you're working with a huge amount of data.
Try to solve your problem with simple, low-tech, easy-to-understand libraries, and only go to something more complicated as you need it. Many times, you won't need the more complex technology.
Using some spark context or hive context method (sc.textFile(), hc.sql()) to read data 'into memory' returns an RDD, but the RDD remains in distributed memory (memory on the worker nodes), not memory on the master node. All the RDD methods (rdd.map(), rdd.reduceByKey(), etc) are designed to run in parallel on the worker nodes, with some exceptions. For instance, if you run a rdd.collect() method, you end up copying the contents of the rdd from all the worker nodes to the master node memory. Thus you lose your distributed compute benefits (but can still run the rdd methods).
Similarly with pandas, when you run toPandas(), you copy the data frame from distributed (worker) memory to the local (master) memory and lose most of your distributed compute capabilities. So, one possible workflow (that I often use) might be to pre-munge your data into a reasonable size using distributed compute methods and then convert to a Pandas data frame for the rich feature set. Hope that helps.

Categories