How to make Spark read only specified rows? - python

Suppose I'm selecting given rows from a large table A. The target rows are given either by a small index table B, or by a list C. The default behavior of
A.join(broadcast(B), 'id').collect()
or
A.where(col('id').isin(C)).collect()
will create a task that reads in all data of A before filtering out the target rows. Take the broadcast join as an example, in the task DAG, we see that the Scan parquet procedure determines columns to read, which in this case, are all columns.
The problem is, since each row of A is quite large, and the selected rows are quite few, ideally it's better to:
read in only the id column of A;
decide the rows to output with broadcast join;
read in only the selected rows to output from A according to step 2.
Is it possible to achieve this?
BTW, rows to output could be scattered in A so it's not possible to make use of partition keys.

will create a task that reads in all data of A
You're wrong. While the first scenario doesn't push any filters, other than IsNotNull on join key in case of inner or left join, the second approach will push In down to the source.
If isin list is large, this might not necessary be faster, but it is optimized nonetheless.
If you want to fully benefit from possible optimization you should still use bucketing (DISTRIBUTE BY) or partitioning (PARTITIONING BY). These are useful in the IS IN scenario, but bucketing can be also used, in the first one, where B is too large to be broadcasted.

Related

Slicing operation with Dask in a optimal way with Python

I have few questions for the slicing operation.
in pandas we can do operation as follows -:
df["A"].iloc[0]
df["B"].iloc[-1]
# here df["A"],df["B"] is sorted
as we can't do this (Slicing and Multiple_col_sorting) with Dask (i am not 100% sure), I used another way to do it
df["A"]=df.sort_values(by=['A'])
first=list(df["A"])[0]
df["B"]=df.sort_values(by=['B'])
end=list(df["B"])[-1]
this way is really time-consuming when the dataframe is large, is there any other way to do this operation?
https://docs.dask.org/en/latest/dataframe-indexing.html
https://docs.dask.org/en/latest/array-slicing.html
I tried working with this, but it does not work.
The index or Dask is different than Pandas because Pandas is a global ordering of the data. Dask is indexed from 1 to N for each partition so there are multiple items with index value of 1. This is why iloc on a row is disallowed I think.
For this specifically, use
first: https://docs.dask.org/en/latest/generated/dask.dataframe.DataFrame.first.html
last:
https://docs.dask.org/en/latest/generated/dask.dataframe.DataFrame.last.html
Sorting is a very expensive operation for large dataframes spread across multiple machines, whereas first and last are very parallelizeable operations because it can be done per partition and then executed again among the results of each partition.
It's possible to get almost .iloc-like behaviour with dask dataframes, but it requires having a pass through the whole dataset once. For very large datasets, this might be a meaningful time cost.
The rough steps are: 1) create a unique index that matches the row numbering (modifying this answer to start from zero or using this answer), and 2) swap .iloc[N] for .loc[N].
This won't help with relative syntax like .iloc[-1], however if you know the total number of rows, you can compute the corresponding absolute position to pass into .loc.

pyspark groupby sum all in one time vs partial where then groupby sum for huge table

suppose we have very very huge table similar like this.
if we use pyspark to call this table and groupby("id").agg('value':'sum')
comparing with
partial call this table by
where date first then
groupby("id").agg('value':'sum')
then sum all partial values together.
Which one should be faster?
Can you please elaborate your question ? I am assuming that you want to perform the following operations
case-1: sum(value) group by id (adds all date)
case-2: sum(value) group by id where date = 1
In either case, your performance will depend on the following:
Cardinality of your id column. Whether you have a very large number of unique values or small unique values repeating.
The type of file format you are using. If it is columnar (like parquet) vs row (like csv)
The partitioning and bucketing strategy you are using for storing the files. If date columns hold few values, go for partitioning, else bucketing.
All these factors will help determine whether the above 2 cases will show similar processing time or they will drastically be different. This is due to the fact, that reading large amount of data will take more time than reading less / pruned data with given filters. Also, your shuffle block size and partitions play key role

Parallel reading and processing from Postgres using (Py)Spark

I have a question regarding reading large amounts of data from a Postgres database, and process it using spark in parallel. Let's assume I have a table in Postgres I would like to read into Spark using JDBC. Let's assume it has the following columns:
id (bigint)
date (datetime)
many other columns (different types)
Currently the Postgres table is not partitioned. I would like to transform a lot of data in parallel, and eventually store the transformed data somewhere else.
Question: How can we optimize parallel reading of the data from Postgres?
The documentation (https://spark.apache.org/docs/latest/sql-data-sources-jdbc.html) suggests to use a partitionColum to process the queries in parallel. In addition, one is required to set lowerBound, and upperBound. From what I understand is that in my case, I can either use the column id and date for partitionColumn. However, the problem here is how to set the lowerBound and upperBound values when partitioning on one of the columns. I noticed that data skew arises in my case if not set properly. For processing in Spark, I do not care about natural partitions. I just need to transform all data as quick as possible, so optimizing for unskewed partitions would be prefered I think.
I have come up with a solution for this, but I am unsure if it actually makes sense to do this. Essentially it is hashing the id's into partitions. My solution would be to do use mod() on the id column with a specified number of partitions. So then the dbtable field in the would be something like:
"(SELECT *, mod(id, <<num-parallel-queries>>) as part FROM <<schema>>.<<table>>) as t"
And then I use partitionColum="part", lowerBound=0, and upperBound=<<num-parallel-queries>> as options for the Spark read JDBC job.
Please, let me know if this makes sense!
It is a good idea to "partition" by the primary key column.
To get partitions of equal size, use the table statistics:
SELECT histogram_bounds::text::bigint[]
FROM pg_stats
WHERE tablename = 'mytable'
AND attname = 'id';
If you have default_statistics_target at its default value of 100, this will be an array of 101 values that delimit the percentiles from 0 to 100. You can use this to partition your table evenly.
For example: if the array looks like {42,10001,23066,35723,49756,...,999960} and you need 50 partitions, the first would be all rows with id < 23066, the second all rows with 23066 ≤ id < 49756, and so on.

Joining a large and a massive spark dataframe

My problem is as follows:
I have a large dataframe called details containing 900K rows and the other one containing 80M rows named attributes.
Both have a column A on which I would like to do a left-outer join, the left dataframe being deatils.
There are only 75K unique entries in column A in the dataframe details. The dataframe attributes 80M unique entries in column A.
What is the best possible way to achieve the join operation?
What have I tried?
The simple join i.e. details.join(attributes, "A", how="left_outer") just times out (or gives out of memory).
Since there are only 75K unique entries in column A in details, we don't care about the rest in the dataframe in attributes. So, first I filter that using:
uniqueA = details.select('A').distinct().collect()
uniqueA = map(lambda x: x.A, uniqueA)
attributes_filtered = attributes.filter(attributes.A.isin(*uniqueA))
I thought this would work out because the attributes table comes down from 80M rows to mere 75K rows. However, it still takes forever to complete the join (and it never completes).
Next, I thought that there are too many partitions and the data to be joined is not on the same partition. Though, I don't know how to bring all the data to the same partition, I figured repartitioning may help. So here it goes.
details_repartitioned = details.repartition("A")
attributes_repartitioned = attributes.repartition("A")
The above operation brings down the number of partitions in attributes from 70K to 200. The number of partitions in details are about 1100.
details_attributes = details_repartitioned.join(broadcast(
attributes_repartitioned), "A", how='left_outer') # tried without broadcast too
After all this, the join still doesn't work. I am still learning PySpark so I might have misunderstood the fundamentals behind repartitioning. If someone could shed light on this, it would be great.
P.S. I have already seen this question but that does not answer this question.
Details table has 900k items with 75k distinct entries in column A. I think the filter on the column A you have tried is a correct direction. However, the collect and followed by the map operation
attributes_filtered = attributes.filter(attributes.A.isin(*uniqueA))
this is too expensive. An alternate approach would be
uniqueA = details.select('A').distinct().persist(StorageLevel.DISK_ONLY)
uniqueA.count // Breaking the DAG lineage
attrJoined = attributes.join(uniqueA, "inner")
Also, you probably need to set the shuffle partition correctly if you haven't done that yet.
One problem could happen in your dataset is that skew. It could happen among 75k unique values only a few joining with a large number of rows in the attribute table. In that case join could take much longer time and may not finish.
To resolve that you need to find the skewed values of column A and process them separately.

get_range in random and ordered partitioner

How does the following statements help in improving program efficiency while handling large number of rows say 500 million.
Random Partitioner:
get_range()
Ordered Partitioner:
get_range(start='rowkey1',finish='rowkey10000')
Also how many rows can be handled at a time while using get_range for ordered partitioner with a column family having more than a million rows.
Thanks
Also how many rows can be handled at a time while using get_range for ordered partitioner with a column family having more than a million rows.
pycassa's get_range() method will work just fine with any number of rows because it automatically breaks the query up into smaller chunks. However, your application needs to use the method the right way. For example, if you do something like:
rows = list(cf.get_range())
Your python program will probably run out of memory. The correct way to use it would be:
for key, columns in cf.get_range():
process_data(key, columns)
This method only pulls in 1024 rows at a time by default. If needed, you can lower that with the buffer_size parameter to get_range().
EDIT: Tyler Hobbs points out in his comment that this answer does not apply to the pycassa driver. Apparently it already takes care of all I mentioned below.
==========
If your question is whether you can select all 500M rows at once with get_range(), then the answer is "no" because Cassandra will run out of memory trying to answer your request.
If your question is whether you can query Cassandra for all rows in batches of N rows at a time if the random partitioner is in use, then the answer is "yes". The difference to using the order preserving partitioner is that you do not know what the first key for your next batch will be, so you have to use the last key of your current batch as the starting key and ignore the row when iterating over the new batch. For the first batch simply use the "empty" key as the key range limits. Also, there is no way to say how far you have come in relative terms by looking at a key that was returned, as the order is not preserved.
As for the number of rows: Start small. Say 10, then try 100, then 1000. Depending on the number of columns you are looking at, index sizes, available heap, etc. you will see a noticeable performance degradation for a single query beyond a certain threshold.

Categories