Optimisations on spark jobs - python

I am new to spark and want to know about optimisations on spark jobs.
My job is a simple transformation type job merging 2 rows based on a condition. What are the various types of optimisations one can perform over such jobs?

More information about the job would help:
Some of the generic suggestions:
Arrangement operators is very important. Not all arrangements will result in the same performance. Arrangement of operators should be such to reduce the number of shuffles and the amount of data shuffled. Shuffles are fairly expensive operations; all shuffle data must be written to disk and then transferred over the network.
repartition , join, cogroup, and any of the *By or *ByKey transformations can result in shuffles.
rdd.groupByKey().mapValues(_.sum) will produce the same results as rdd.reduceByKey(_ + _). However, the former will transfer the entire dataset across the network, while the latter will compute local sums for each key in each partition and combine those local sums into larger sums after shuffling.
You can avoid shuffles when joining two datasets by taking advantage
of broadcast variables.
Avoid the flatMap-join-groupBy pattern.
Avoid reduceByKey When the input and output value types are different.
This is not exhaustive. And you should also consider tuning your configuration of Spark.
I hope it helps.

Related

Slicing operation with Dask in a optimal way with Python

I have few questions for the slicing operation.
in pandas we can do operation as follows -:
df["A"].iloc[0]
df["B"].iloc[-1]
# here df["A"],df["B"] is sorted
as we can't do this (Slicing and Multiple_col_sorting) with Dask (i am not 100% sure), I used another way to do it
df["A"]=df.sort_values(by=['A'])
first=list(df["A"])[0]
df["B"]=df.sort_values(by=['B'])
end=list(df["B"])[-1]
this way is really time-consuming when the dataframe is large, is there any other way to do this operation?
https://docs.dask.org/en/latest/dataframe-indexing.html
https://docs.dask.org/en/latest/array-slicing.html
I tried working with this, but it does not work.
The index or Dask is different than Pandas because Pandas is a global ordering of the data. Dask is indexed from 1 to N for each partition so there are multiple items with index value of 1. This is why iloc on a row is disallowed I think.
For this specifically, use
first: https://docs.dask.org/en/latest/generated/dask.dataframe.DataFrame.first.html
last:
https://docs.dask.org/en/latest/generated/dask.dataframe.DataFrame.last.html
Sorting is a very expensive operation for large dataframes spread across multiple machines, whereas first and last are very parallelizeable operations because it can be done per partition and then executed again among the results of each partition.
It's possible to get almost .iloc-like behaviour with dask dataframes, but it requires having a pass through the whole dataset once. For very large datasets, this might be a meaningful time cost.
The rough steps are: 1) create a unique index that matches the row numbering (modifying this answer to start from zero or using this answer), and 2) swap .iloc[N] for .loc[N].
This won't help with relative syntax like .iloc[-1], however if you know the total number of rows, you can compute the corresponding absolute position to pass into .loc.

Do we have to explicitly use RDDs and operations like foreach, parallelize to perform parallel processing in pyspark?

If we use normal python operations for group by, merge, summing two dataframes versus explicitly using map, reducebykey, groupbykey etc, are there performance differences?
Is the former (normal operations) simple sequential processing while the latter parallel?
Does this mean for activting parallel processing we have to explicity use RDDs? normal dataframe usage is not parallel processing (despite done in pyspark)?
Do we have to explicitly use RDDs ... to perform parallel processing in pyspark?
No. Spark APIs are distributed as well.
"Will there be a difference in performance of Dataframes vs RDDs for specific situations?" E.g. df1.join(df2) vs df1.rdd.union(df2.rdd)
Yes.
I at least can't quote a source that says one is more performant in all situations with compelling proof. E.g. this.
But by and large, word of mouth is Spark is faster.

Dask DataFrames vs numpy.memmap performance

I've developed a model which uses several large, 3-dimensional datasets on the order of (1e7, 10, 1e5), and makes millions of read (and thousands of write) calls on slices of those datasets. So far the best tool I've found for making these calls is numpy.memmap, which allows for minimal data to be held in RAM and allows for clean indexing and very fast recall of data directly on the hard drive.
The downside of numpy.memmmap seems to be that performance is pretty uneven - the time to read a slice of the array may vary by 2 orders of magnitude between calls. Furthermore, I'm using Dask to parallelize many of the model functions in the script.
How is the performance of Dask DataFrames for making millions of calls to a large dataset? Would swapping out memmaps for DataFrames substantially increase processing times?
You would need to use Dask Array not Dask Dataframe. The performance is generally the same as Numpy, because Numpy does the actual computation.
Optimizations can speed up the calculation depending on the use case.
The overhead of the scheduler decreases performance. This is only applicable if you split the data into many partition and can usually be neglected.

Difference between DISTRIBUTE BY and Shuffle in Spark-SQL

I am trying to understand Distribute by clause and how it could be used in Spark-SQL to optimize Sort-Merge Joins.
As per my understanding, the Spark Sql optimizer will distribute the datasets of both the participating tables (of the join) based on the join keys (shuffle phase) to co-locate the same keys in the same partition. If that is the case, then if we use the distribute by in the sql, then also we are doing the same thing.
So in what way can distribute by could be used ameliorate join performance ? Or is it that it is better to use distribute by while writing the data to disk by the load process, so that subsequent queries using this data will benefit from it by not having to shuffle it ?
Can you please explain with a real-world example to tune join using distribute by/cluster by in Spark-SQL ?
Let me try to answer each part of your question:
As per my understanding, the Spark Sql optimizer will distribute the datasets of both the participating tables (of the join) based on the join keys (shuffle phase) to co-locate the same keys in the same partition. If that is the case, then if we use the distribute by in the sql, then also we are doing the same thing.
Yes that is correct.
So in what way can distribute by could be used ameliorate join performance ?
Sometimes one of your tables is already distributed, for example the table was bucketed or the data was aggregated before the join by the same key. In this case if you explicitly repartition also the second table (distribute by) you will achieve the same partitioning in both branches of the join and Spark will not induce any more shuffle in the first branch (sometimes this is referenced to as one-side shuffle-free join because the shuffle will occur only in one branch of the join - the one in which you call repartition / distribute by). On the other hand if you don't repartition explicitly the other table, Spark will see that each branch of the join has different partitioning and thus it will shuffle both branches. So in some special cases calling repartition (distribute by) can save you one shuffle.
Notice that to make this work you need to achieve also the same number of partitions in both branches. So if you have two tables that you want to join on the key user_id and if the first table is bucketed into 10 buckets with this key then you need to repartition the other table also into 10 partitions by the same key and then the join will have only one shuffle (in the physical plan you can see that there will be Exchange operator only in one brach of the join).
Or is it that it is better to use distribute by while writing the data to disk by the load process, so that subsequent queries using this data will benefit from it by not having to shuffle it ?
Well, this is actually called bucketing (cluster by) and it will allow you to pre-shuffle the data once and then each time you read the data and join it (or aggregate) by the same key by which you bucketed, it will be free of shuffle. So yes, this is very common technique that you pay the cost only once when saving the data and then leverage that each time you read it.

Is .withColumn and .agg calculated in parallel in pyspark?

Consider for example
df.withColumn("customr_num", col("customr_num").cast("integer")).\
withColumn("customr_type", col("customr_type").cast("integer")).\
agg(myMax(sCollect_list("customr_num")).alias("myMaxCustomr_num"), \
myMean(sCollect_list("customr_type")).alias("myMeanCustomr_type"), \
myMean(sCollect_list("customr_num")).alias("myMeancustomr_num"),\
sMin("customr_num").alias("min_customr_num")).show()
Are .withColumn and the list of functions inside agg (sMin, myMax, myMean, etc.) calculated in parallel by Spark, or in sequence ?
If sequential, how do we parallelize them ?
By essence, as long as you have more than one partition, operations are always parallelized in spark. If what you mean though is, are the withColumn operations going to be computed in one pass over the dataset, then the answer is also yes. In general, you can use the Spark UI to know more about the way things are computed.
Let's take an example that's very similar to your example.
spark.range(1000)
.withColumn("test", 'id cast "double")
.withColumn("test2", 'id + 10)
.agg(sum('id), mean('test2), count('*))
.show
And let's have a look at the UI.
Range corresponds to the creation of the data, then you have project (the two withColumn operations) and then the aggregation (agg) within each partition (we have 2 here). In a given partition, these things are done sequentially, but for all the partitions at the same time. Also, they are in the same stage (on blue box) which mean that they are all computed in one pass over the data.
Then there is a shuffle (exchange) which mean that data is exchanged over the network (the result of the aggregations per partition) and the final aggregation is performed (HashAggregate) and then sent to the driver (collect)

Categories