Difference between DISTRIBUTE BY and Shuffle in Spark-SQL - python

I am trying to understand Distribute by clause and how it could be used in Spark-SQL to optimize Sort-Merge Joins.
As per my understanding, the Spark Sql optimizer will distribute the datasets of both the participating tables (of the join) based on the join keys (shuffle phase) to co-locate the same keys in the same partition. If that is the case, then if we use the distribute by in the sql, then also we are doing the same thing.
So in what way can distribute by could be used ameliorate join performance ? Or is it that it is better to use distribute by while writing the data to disk by the load process, so that subsequent queries using this data will benefit from it by not having to shuffle it ?
Can you please explain with a real-world example to tune join using distribute by/cluster by in Spark-SQL ?

Let me try to answer each part of your question:
As per my understanding, the Spark Sql optimizer will distribute the datasets of both the participating tables (of the join) based on the join keys (shuffle phase) to co-locate the same keys in the same partition. If that is the case, then if we use the distribute by in the sql, then also we are doing the same thing.
Yes that is correct.
So in what way can distribute by could be used ameliorate join performance ?
Sometimes one of your tables is already distributed, for example the table was bucketed or the data was aggregated before the join by the same key. In this case if you explicitly repartition also the second table (distribute by) you will achieve the same partitioning in both branches of the join and Spark will not induce any more shuffle in the first branch (sometimes this is referenced to as one-side shuffle-free join because the shuffle will occur only in one branch of the join - the one in which you call repartition / distribute by). On the other hand if you don't repartition explicitly the other table, Spark will see that each branch of the join has different partitioning and thus it will shuffle both branches. So in some special cases calling repartition (distribute by) can save you one shuffle.
Notice that to make this work you need to achieve also the same number of partitions in both branches. So if you have two tables that you want to join on the key user_id and if the first table is bucketed into 10 buckets with this key then you need to repartition the other table also into 10 partitions by the same key and then the join will have only one shuffle (in the physical plan you can see that there will be Exchange operator only in one brach of the join).
Or is it that it is better to use distribute by while writing the data to disk by the load process, so that subsequent queries using this data will benefit from it by not having to shuffle it ?
Well, this is actually called bucketing (cluster by) and it will allow you to pre-shuffle the data once and then each time you read the data and join it (or aggregate) by the same key by which you bucketed, it will be free of shuffle. So yes, this is very common technique that you pay the cost only once when saving the data and then leverage that each time you read it.

Related

How to determine if one SQL statement generates a dataset that is a subset/superset of other SQL statement?

I want to write a system to materialize partial datasets that are repeated across the workload of a warehouse, so an expensive and common computation is computed only once, and all subsequent queries using that dataset (or a subset of that dataset) do a plain SELECT from the materialized dataset instead of running the query again. The warehouse is using Spark + HDFS.
Finding the most common queries, subqueries and CTEs was easy and already solved.
What I'm stuck now is to find a way to find materialization candidates, I can't device a method to identify if one query is a superset of another (or, conversely, one query being a subset of another). If I knew that, then it would be possible to do query rewriting and replace the expensive subquery by the already materialized table.
One way I was thinking about this is to build a graph where queries and sub-queries are nodes and the edges represent the fact that one query is answered by the other. But I'm having a hard time building the edges of this graph.
In summary: Given 2 SQL statements, for example, given 2 queries:
Q1: SELECT col1, col2 FROM table;
and
Q2: SELECT col1 FROM table WHERE col2 > 0;
How can I determine Q2 is a subset of Q1?
This is a simple example, joins, aggregations, unions, where and having conditions also must be taken into consideration.
Evaluating datasets it is not an option, should be determined just based on the SQL statement and/or the execution plans (we are talking about multi petabytes datasets and comparing datasets is too resource intensive)
My ideas so far: take the list of projections and compare, of the list of projected columns of Q1 contains the list of projections of Q2 then Q1 might be a superset of Q2.
Also check the columns involved in the filters and what filters are being applied.

Is .withColumn and .agg calculated in parallel in pyspark?

Consider for example
df.withColumn("customr_num", col("customr_num").cast("integer")).\
withColumn("customr_type", col("customr_type").cast("integer")).\
agg(myMax(sCollect_list("customr_num")).alias("myMaxCustomr_num"), \
myMean(sCollect_list("customr_type")).alias("myMeanCustomr_type"), \
myMean(sCollect_list("customr_num")).alias("myMeancustomr_num"),\
sMin("customr_num").alias("min_customr_num")).show()
Are .withColumn and the list of functions inside agg (sMin, myMax, myMean, etc.) calculated in parallel by Spark, or in sequence ?
If sequential, how do we parallelize them ?
By essence, as long as you have more than one partition, operations are always parallelized in spark. If what you mean though is, are the withColumn operations going to be computed in one pass over the dataset, then the answer is also yes. In general, you can use the Spark UI to know more about the way things are computed.
Let's take an example that's very similar to your example.
spark.range(1000)
.withColumn("test", 'id cast "double")
.withColumn("test2", 'id + 10)
.agg(sum('id), mean('test2), count('*))
.show
And let's have a look at the UI.
Range corresponds to the creation of the data, then you have project (the two withColumn operations) and then the aggregation (agg) within each partition (we have 2 here). In a given partition, these things are done sequentially, but for all the partitions at the same time. Also, they are in the same stage (on blue box) which mean that they are all computed in one pass over the data.
Then there is a shuffle (exchange) which mean that data is exchanged over the network (the result of the aggregations per partition) and the final aggregation is performed (HashAggregate) and then sent to the driver (collect)

How to make Spark read only specified rows?

Suppose I'm selecting given rows from a large table A. The target rows are given either by a small index table B, or by a list C. The default behavior of
A.join(broadcast(B), 'id').collect()
or
A.where(col('id').isin(C)).collect()
will create a task that reads in all data of A before filtering out the target rows. Take the broadcast join as an example, in the task DAG, we see that the Scan parquet procedure determines columns to read, which in this case, are all columns.
The problem is, since each row of A is quite large, and the selected rows are quite few, ideally it's better to:
read in only the id column of A;
decide the rows to output with broadcast join;
read in only the selected rows to output from A according to step 2.
Is it possible to achieve this?
BTW, rows to output could be scattered in A so it's not possible to make use of partition keys.
will create a task that reads in all data of A
You're wrong. While the first scenario doesn't push any filters, other than IsNotNull on join key in case of inner or left join, the second approach will push In down to the source.
If isin list is large, this might not necessary be faster, but it is optimized nonetheless.
If you want to fully benefit from possible optimization you should still use bucketing (DISTRIBUTE BY) or partitioning (PARTITIONING BY). These are useful in the IS IN scenario, but bucketing can be also used, in the first one, where B is too large to be broadcasted.

Optimisations on spark jobs

I am new to spark and want to know about optimisations on spark jobs.
My job is a simple transformation type job merging 2 rows based on a condition. What are the various types of optimisations one can perform over such jobs?
More information about the job would help:
Some of the generic suggestions:
Arrangement operators is very important. Not all arrangements will result in the same performance. Arrangement of operators should be such to reduce the number of shuffles and the amount of data shuffled. Shuffles are fairly expensive operations; all shuffle data must be written to disk and then transferred over the network.
repartition , join, cogroup, and any of the *By or *ByKey transformations can result in shuffles.
rdd.groupByKey().mapValues(_.sum) will produce the same results as rdd.reduceByKey(_ + _). However, the former will transfer the entire dataset across the network, while the latter will compute local sums for each key in each partition and combine those local sums into larger sums after shuffling.
You can avoid shuffles when joining two datasets by taking advantage
of broadcast variables.
Avoid the flatMap-join-groupBy pattern.
Avoid reduceByKey When the input and output value types are different.
This is not exhaustive. And you should also consider tuning your configuration of Spark.
I hope it helps.

ways to detect data redundancy between tables with different structures

I'm working on a problem that involves multiple database instances, each with different table structures. The problem is, between these tables, there are lots and lots of duplicates, and i need a way to efficiently find them, report them, and possibly eliminate them.
Eg. I have two tables, the first table, CustomerData with the fields:
_countId, customerFID, customerName, customerAddress, _someRandomFlags
and I have another table, CustomerData2 (built later) with the fields:
_countId, customerFID, customerFirstName, customerLocation, _someOtherRandomFlags.
Between the two tables above, I know for a fact that customerName and customerFirstName were used to store the same data, and similarly customerLocation and customerAddress were also used to store the same data.
Lets say, some of the sales team have been using customerData, and others have been using customerData2. I'd like to have a scalable way of detecting the redundancies between the tables and report them. It can be assumed with some amount of surety that customerFID in both tables are consistent, and refer to the same customer.
One solution I could think off was, to create a customerData class in python, map the records in the two tables to this class, and compute a hash/signature for the objects within the class that are required (customerName, customerLocation/Address) and store them to a signature table, which has the columns:
sourceTableName, entityType (customerData), identifyingKey (customerFID), signature
and then for each entityType, I look for duplicate signatures for each customerFID
In reality, I'm working with huge sets of biomedical data, with lots and lots of columns. They were created at different people (and sadly with no standard nomenclature or structure) and have been duplicate data stored in them
EDIT:
For simplicity sake, I can move all the database instances to a single server instance.
If I couldn't care for performance, I'd use a high-level practical approach. Use Django (or SQLAlchemy or...) to build your desired models (your tables) and fetch the data to compare. Then use an algorithm for efficiently identifying duplicates (...from lists or dicts,it depends of "how" you hold your data). To boost performance you may try to "enhance" your app with the multiprocessing module or consider a map-reduce solution.

Categories