Joining a large and a massive spark dataframe - python

My problem is as follows:
I have a large dataframe called details containing 900K rows and the other one containing 80M rows named attributes.
Both have a column A on which I would like to do a left-outer join, the left dataframe being deatils.
There are only 75K unique entries in column A in the dataframe details. The dataframe attributes 80M unique entries in column A.
What is the best possible way to achieve the join operation?
What have I tried?
The simple join i.e. details.join(attributes, "A", how="left_outer") just times out (or gives out of memory).
Since there are only 75K unique entries in column A in details, we don't care about the rest in the dataframe in attributes. So, first I filter that using:
uniqueA = details.select('A').distinct().collect()
uniqueA = map(lambda x: x.A, uniqueA)
attributes_filtered = attributes.filter(attributes.A.isin(*uniqueA))
I thought this would work out because the attributes table comes down from 80M rows to mere 75K rows. However, it still takes forever to complete the join (and it never completes).
Next, I thought that there are too many partitions and the data to be joined is not on the same partition. Though, I don't know how to bring all the data to the same partition, I figured repartitioning may help. So here it goes.
details_repartitioned = details.repartition("A")
attributes_repartitioned = attributes.repartition("A")
The above operation brings down the number of partitions in attributes from 70K to 200. The number of partitions in details are about 1100.
details_attributes = details_repartitioned.join(broadcast(
attributes_repartitioned), "A", how='left_outer') # tried without broadcast too
After all this, the join still doesn't work. I am still learning PySpark so I might have misunderstood the fundamentals behind repartitioning. If someone could shed light on this, it would be great.
P.S. I have already seen this question but that does not answer this question.

Details table has 900k items with 75k distinct entries in column A. I think the filter on the column A you have tried is a correct direction. However, the collect and followed by the map operation
attributes_filtered = attributes.filter(attributes.A.isin(*uniqueA))
this is too expensive. An alternate approach would be
uniqueA = details.select('A').distinct().persist(StorageLevel.DISK_ONLY)
uniqueA.count // Breaking the DAG lineage
attrJoined = attributes.join(uniqueA, "inner")
Also, you probably need to set the shuffle partition correctly if you haven't done that yet.
One problem could happen in your dataset is that skew. It could happen among 75k unique values only a few joining with a large number of rows in the attribute table. In that case join could take much longer time and may not finish.
To resolve that you need to find the skewed values of column A and process them separately.

Related

Pandas: Large DF and optimizing isin search?

I have a df consisting of many millions of rows. I need to run a recursive procedure which basically runs this repeatedly until a condition exhausts itself.
# df index is set to the search column -- this helps a lot, sorting actually hurts performance (surprisingly?)
df = df.set_index('search_col')
# the search function; pull some cols of interest
df[df.index.isin(ids_to_search)][['val1', 'val2']].to_numpy()
Recursion happens because I need to find all the children IDs associated with one ultimate parent ID. The process is as follows:
Load single parent ID
Search for its children IDs
Use step 2 children IDs as new parent IDs
Search for its children IDs
Repeat 3+ until no more children IDs are found
The above is not bad, but with thousands of things to check, n times with recursion, its a slow process at the end of the day.
ids_to_search consists of length 32 random strings in a list, sometimes involving dozens or hundreds of strings to check.
What other tricks might I try to employ?
Edit: Other Attempts
Other attempts that I have done, which did not perform better are:
Using modin, leveraging the Dask engine
Swifter + modin, leveraging the Dask engine
Swapping pandas isin (and the dataframe to fully numpy, too) with numpy's np.in1d, ultimately to use JIT/Numba but I could not get it to work

PySpark repartition according to a specific column

I am looking how to repartition (in PySpark) a dataset so that all rows that have the same ID in a specified column move to the same partition. In fact, I have to run in each partition a program which computes a single value for all rows having the same ID.
I have a dataframe (df) built from a HIVE QL query (with lets say contains 10000 distinct IDs).
I tried :
df = df.repartition("My_Column_Name")
By default I get 200 partitions, but always I obtain 199 IDs for which I got duplicated computed values when I run the program.
I looked on the web, and some people recommended to define a custom partitioner to use with repartition method, but I wasn't able to find how to do that in Python.
Is there a way to do this repartition correctly?
I want only to have ALL rows with the same IDs moved to the same partition. No problem if a partition contains several groups of rows with distinct IDs. 1000 was just an example, the number of different IDs can be very high. So, partitionning a DF to number of different IDs partitions should not lead to good performances. I need that because I run a function (which cannot be implemented using basic Spark transformation functions) using RDD mapPartition method. This function produces one result per distinct ID, this is why I need to have all rows with the same ID in the same partition.

How to make Spark read only specified rows?

Suppose I'm selecting given rows from a large table A. The target rows are given either by a small index table B, or by a list C. The default behavior of
A.join(broadcast(B), 'id').collect()
or
A.where(col('id').isin(C)).collect()
will create a task that reads in all data of A before filtering out the target rows. Take the broadcast join as an example, in the task DAG, we see that the Scan parquet procedure determines columns to read, which in this case, are all columns.
The problem is, since each row of A is quite large, and the selected rows are quite few, ideally it's better to:
read in only the id column of A;
decide the rows to output with broadcast join;
read in only the selected rows to output from A according to step 2.
Is it possible to achieve this?
BTW, rows to output could be scattered in A so it's not possible to make use of partition keys.
will create a task that reads in all data of A
You're wrong. While the first scenario doesn't push any filters, other than IsNotNull on join key in case of inner or left join, the second approach will push In down to the source.
If isin list is large, this might not necessary be faster, but it is optimized nonetheless.
If you want to fully benefit from possible optimization you should still use bucketing (DISTRIBUTE BY) or partitioning (PARTITIONING BY). These are useful in the IS IN scenario, but bucketing can be also used, in the first one, where B is too large to be broadcasted.

Storing independent data tables in python with pandas

I have put ~100 dataframes containing data into a list tables and a list of names (so I can call by name or just iterate over the whole bunch without needing names)
This data will need to be stored, appended to and later queried. So I want to store it as a pandas hdf5 store.
There are ~100 DFs but I can group them into pairs (two different observers).
In the end I want to iterate over all the list of tables but also
I've thought about Panels (but that will have annoying NaN values since the tables aren't the same length), hierachical hd5f (but that doesn't really solve anything, just groups by observer), one continuous dataframe (seeming as they have the same number of columns) (but that will just make it harder because I'll have to piece the DFs back together afterwards).
Is there anything blatantly obvious I'm missing, or am I just going to have to grin and bear it with one these? (if so which one would you go for to give the greatest flexibility?)
Thanks

get_range in random and ordered partitioner

How does the following statements help in improving program efficiency while handling large number of rows say 500 million.
Random Partitioner:
get_range()
Ordered Partitioner:
get_range(start='rowkey1',finish='rowkey10000')
Also how many rows can be handled at a time while using get_range for ordered partitioner with a column family having more than a million rows.
Thanks
Also how many rows can be handled at a time while using get_range for ordered partitioner with a column family having more than a million rows.
pycassa's get_range() method will work just fine with any number of rows because it automatically breaks the query up into smaller chunks. However, your application needs to use the method the right way. For example, if you do something like:
rows = list(cf.get_range())
Your python program will probably run out of memory. The correct way to use it would be:
for key, columns in cf.get_range():
process_data(key, columns)
This method only pulls in 1024 rows at a time by default. If needed, you can lower that with the buffer_size parameter to get_range().
EDIT: Tyler Hobbs points out in his comment that this answer does not apply to the pycassa driver. Apparently it already takes care of all I mentioned below.
==========
If your question is whether you can select all 500M rows at once with get_range(), then the answer is "no" because Cassandra will run out of memory trying to answer your request.
If your question is whether you can query Cassandra for all rows in batches of N rows at a time if the random partitioner is in use, then the answer is "yes". The difference to using the order preserving partitioner is that you do not know what the first key for your next batch will be, so you have to use the last key of your current batch as the starting key and ignore the row when iterating over the new batch. For the first batch simply use the "empty" key as the key range limits. Also, there is no way to say how far you have come in relative terms by looking at a key that was returned, as the order is not preserved.
As for the number of rows: Start small. Say 10, then try 100, then 1000. Depending on the number of columns you are looking at, index sizes, available heap, etc. you will see a noticeable performance degradation for a single query beyond a certain threshold.

Categories