Is .withColumn and .agg calculated in parallel in pyspark? - python

Consider for example
df.withColumn("customr_num", col("customr_num").cast("integer")).\
withColumn("customr_type", col("customr_type").cast("integer")).\
agg(myMax(sCollect_list("customr_num")).alias("myMaxCustomr_num"), \
myMean(sCollect_list("customr_type")).alias("myMeanCustomr_type"), \
myMean(sCollect_list("customr_num")).alias("myMeancustomr_num"),\
sMin("customr_num").alias("min_customr_num")).show()
Are .withColumn and the list of functions inside agg (sMin, myMax, myMean, etc.) calculated in parallel by Spark, or in sequence ?
If sequential, how do we parallelize them ?

By essence, as long as you have more than one partition, operations are always parallelized in spark. If what you mean though is, are the withColumn operations going to be computed in one pass over the dataset, then the answer is also yes. In general, you can use the Spark UI to know more about the way things are computed.
Let's take an example that's very similar to your example.
spark.range(1000)
.withColumn("test", 'id cast "double")
.withColumn("test2", 'id + 10)
.agg(sum('id), mean('test2), count('*))
.show
And let's have a look at the UI.
Range corresponds to the creation of the data, then you have project (the two withColumn operations) and then the aggregation (agg) within each partition (we have 2 here). In a given partition, these things are done sequentially, but for all the partitions at the same time. Also, they are in the same stage (on blue box) which mean that they are all computed in one pass over the data.
Then there is a shuffle (exchange) which mean that data is exchanged over the network (the result of the aggregations per partition) and the final aggregation is performed (HashAggregate) and then sent to the driver (collect)

Related

Slicing operation with Dask in a optimal way with Python

I have few questions for the slicing operation.
in pandas we can do operation as follows -:
df["A"].iloc[0]
df["B"].iloc[-1]
# here df["A"],df["B"] is sorted
as we can't do this (Slicing and Multiple_col_sorting) with Dask (i am not 100% sure), I used another way to do it
df["A"]=df.sort_values(by=['A'])
first=list(df["A"])[0]
df["B"]=df.sort_values(by=['B'])
end=list(df["B"])[-1]
this way is really time-consuming when the dataframe is large, is there any other way to do this operation?
https://docs.dask.org/en/latest/dataframe-indexing.html
https://docs.dask.org/en/latest/array-slicing.html
I tried working with this, but it does not work.
The index or Dask is different than Pandas because Pandas is a global ordering of the data. Dask is indexed from 1 to N for each partition so there are multiple items with index value of 1. This is why iloc on a row is disallowed I think.
For this specifically, use
first: https://docs.dask.org/en/latest/generated/dask.dataframe.DataFrame.first.html
last:
https://docs.dask.org/en/latest/generated/dask.dataframe.DataFrame.last.html
Sorting is a very expensive operation for large dataframes spread across multiple machines, whereas first and last are very parallelizeable operations because it can be done per partition and then executed again among the results of each partition.
It's possible to get almost .iloc-like behaviour with dask dataframes, but it requires having a pass through the whole dataset once. For very large datasets, this might be a meaningful time cost.
The rough steps are: 1) create a unique index that matches the row numbering (modifying this answer to start from zero or using this answer), and 2) swap .iloc[N] for .loc[N].
This won't help with relative syntax like .iloc[-1], however if you know the total number of rows, you can compute the corresponding absolute position to pass into .loc.

Dask: Sorting truly lazily

If I have a dataset with unknown divisions and would like to sort it according to a column and output to Parquet, it seems to me that Dask does at least some of the work twice:
import dask
import dask.dataframe as dd
def my_identity(x):
"""Does nothing, but shows up on the Dask dashboard"""
return x
df = dask.datasets.timeseries()
df = df.map_partitions(my_identity)
df = df.set_index(['name']) # <- `my_identity` is calculated here, as well as other tasks
df.to_parquet('temp.parq') # <- previous tasks seem to be recalculated here
If my_identity was computationally demanding, then recomputing it would be really costly.
Am I correct in my understanding that Dask does some work twice here? Is there any way to prevent that?
The explanation below may not be accurate, but hopefully helps a bit.
Let's try to get into dask's shoes on this. We are asking dask to create an index based on some variable... Dask only works with sorted indexes, so Dask will want to know how to re-arrange data to make it sorted and also what will be the appropriate divisions for the partitions. The first calculation you see is doing that, and dask will store only the parts of calculation necessary for the divisions/data-reshuffling.
Then when we ask Dask to save the data, it computes the variables, shuffles the data (in line with the previous computations) and stores it in corresponding partitions.
How to avoid this? Possible options:
persist before setting the index. Once you persist, dask will compute the variable and keep it on workers, so setting index will refer to the results of that computation. There will still be reshuffling of the data needed). Note that the documentation suggests persisting after setting the index, but that case assumes that the column exists (does not require separate computation).
Sort within partitions, this can be done lazily, but of course it's only an option if you do not need a global sort.
Use plain pandas, this may necessitate some manual chunking of the data (what I tend to use for sorting).

How to make Spark read only specified rows?

Suppose I'm selecting given rows from a large table A. The target rows are given either by a small index table B, or by a list C. The default behavior of
A.join(broadcast(B), 'id').collect()
or
A.where(col('id').isin(C)).collect()
will create a task that reads in all data of A before filtering out the target rows. Take the broadcast join as an example, in the task DAG, we see that the Scan parquet procedure determines columns to read, which in this case, are all columns.
The problem is, since each row of A is quite large, and the selected rows are quite few, ideally it's better to:
read in only the id column of A;
decide the rows to output with broadcast join;
read in only the selected rows to output from A according to step 2.
Is it possible to achieve this?
BTW, rows to output could be scattered in A so it's not possible to make use of partition keys.
will create a task that reads in all data of A
You're wrong. While the first scenario doesn't push any filters, other than IsNotNull on join key in case of inner or left join, the second approach will push In down to the source.
If isin list is large, this might not necessary be faster, but it is optimized nonetheless.
If you want to fully benefit from possible optimization you should still use bucketing (DISTRIBUTE BY) or partitioning (PARTITIONING BY). These are useful in the IS IN scenario, but bucketing can be also used, in the first one, where B is too large to be broadcasted.

Should the DataFrame function groupBy be avoided?

This link and others tell me that the Spark groupByKey is not to be used if there is a large number of keys, since Spark shuffles all the keys around. Does the same apply to the groupBy function as well? Or is this something different?
I'm asking this because I want to do what this question tries to do, but I have a very large number of keys. It should be possible to do this without shuffling all the data around by reducing on each node locally, but I can't find the PySpark way to do this (frankly, I find the documentation quite lacking).
Essentially, I am trying to do is:
# Non-working pseudocode
df.groupBy("A").reduce(lambda x,y: if (x.TotalValue > y.TotalValue) x else y)
However, the dataframe API does not offer a "reduce" option. I'm probably misunderstanding what exactly dataframe is trying to achieve.
A DataFrame groupBy followed by an agg will not move the data around unnecessarily, see here for a good example. Hence, there is no need to avoid it.
When using the RDD API, the opposite is true. Here it is preferable to avoid groupByKey and use a reducebyKey or combineByKey where possible. Some situations, however, do require one to use groupByKey.
The normal way to do this type of operation with the DataFrame API is to use groupBy followed by an aggregation using agg. In your example case, you want to find the maximum value for a single column for each group, this can be achived by the max function:
from pyspark.sql import functions as F
joined_df.groupBy("A").agg(F.max("TotalValue").alias("MaxValue"))
In addition to max there are a multitude of functions that can be used in combination with agg, see here for all operations.
The documentation is pretty all over the place.
There has been a lot of optimization work for dataframes. Dataframes has additional information about the structure of your data, which helps with this. I often find that many people recommend dataframes over RDDs due to "increased optimization."
There is a lot of heavy wizardry behind the scenes.
I recommend that you try "groupBy" on both RDDs and dataframes on large datasets and compare the results. Sometimes, you may need to just do it.
Also, for performance improvements, I suggest fiddling (through trial and error) with:
the spark configurations Doc
shuffle.partitions Doc

Optimisations on spark jobs

I am new to spark and want to know about optimisations on spark jobs.
My job is a simple transformation type job merging 2 rows based on a condition. What are the various types of optimisations one can perform over such jobs?
More information about the job would help:
Some of the generic suggestions:
Arrangement operators is very important. Not all arrangements will result in the same performance. Arrangement of operators should be such to reduce the number of shuffles and the amount of data shuffled. Shuffles are fairly expensive operations; all shuffle data must be written to disk and then transferred over the network.
repartition , join, cogroup, and any of the *By or *ByKey transformations can result in shuffles.
rdd.groupByKey().mapValues(_.sum) will produce the same results as rdd.reduceByKey(_ + _). However, the former will transfer the entire dataset across the network, while the latter will compute local sums for each key in each partition and combine those local sums into larger sums after shuffling.
You can avoid shuffles when joining two datasets by taking advantage
of broadcast variables.
Avoid the flatMap-join-groupBy pattern.
Avoid reduceByKey When the input and output value types are different.
This is not exhaustive. And you should also consider tuning your configuration of Spark.
I hope it helps.

Categories