Should the DataFrame function groupBy be avoided? - python

This link and others tell me that the Spark groupByKey is not to be used if there is a large number of keys, since Spark shuffles all the keys around. Does the same apply to the groupBy function as well? Or is this something different?
I'm asking this because I want to do what this question tries to do, but I have a very large number of keys. It should be possible to do this without shuffling all the data around by reducing on each node locally, but I can't find the PySpark way to do this (frankly, I find the documentation quite lacking).
Essentially, I am trying to do is:
# Non-working pseudocode
df.groupBy("A").reduce(lambda x,y: if (x.TotalValue > y.TotalValue) x else y)
However, the dataframe API does not offer a "reduce" option. I'm probably misunderstanding what exactly dataframe is trying to achieve.

A DataFrame groupBy followed by an agg will not move the data around unnecessarily, see here for a good example. Hence, there is no need to avoid it.
When using the RDD API, the opposite is true. Here it is preferable to avoid groupByKey and use a reducebyKey or combineByKey where possible. Some situations, however, do require one to use groupByKey.
The normal way to do this type of operation with the DataFrame API is to use groupBy followed by an aggregation using agg. In your example case, you want to find the maximum value for a single column for each group, this can be achived by the max function:
from pyspark.sql import functions as F
joined_df.groupBy("A").agg(F.max("TotalValue").alias("MaxValue"))
In addition to max there are a multitude of functions that can be used in combination with agg, see here for all operations.

The documentation is pretty all over the place.
There has been a lot of optimization work for dataframes. Dataframes has additional information about the structure of your data, which helps with this. I often find that many people recommend dataframes over RDDs due to "increased optimization."
There is a lot of heavy wizardry behind the scenes.
I recommend that you try "groupBy" on both RDDs and dataframes on large datasets and compare the results. Sometimes, you may need to just do it.
Also, for performance improvements, I suggest fiddling (through trial and error) with:
the spark configurations Doc
shuffle.partitions Doc

Related

How do I perform deduplication with the python record linkage toolkit with large data sets?

I am currently using Python Record Linkage Toolkit to perform deduplication on data sets at work. In an ideal world, I would just use blocking or sortedneighborhood to trim down the size of the index of record pairs, but sometimes I need to do a full index on a data set with over 75k records, which results in a couple billion records pairs.
The issue I'm running into is that the workstation I'm able to use is running out of memory, so it can't store the full 2.5-3 billion pair multi-index. I know the documentation has ideas for doing record linkage with two large data sets using numpy split, which is simple enough for my usage, but doesn't provide anything for deduplication within a single dataframe. I actually incorporated this subset suggestion into a method for splitting the multiindex into subsets and running those, but it doesn't get around the issue of the .index() call seemingly loading the entire multiindex into memory and causing an out of memory error.
Is there a way to split a dataframe and compute the matched pairs iteratively so I don't have to load the whole kit and kaboodle into memory at once? I was looking at dask, but I'm still pretty green on the whole python thing, so I don't know how to incorporate the dask dataframes into the record linkage toolkit.
While I was able to solve this, sort of, I am going to leave it open because I suspect given my inexperience with python, my process could be improved.
Basically, I had to ditch the index function from record linkage toolkit. I pulled out the Index of the dataframe I was using, and then converted it to a list, and passed it through the itertools combinations function.
candidates = fl
candidates = candidates.index
candidates = candidates.tolist()
candidates = combinations(candidates,2)
This then gave me an iteration object full of tuples, without having to load everything in to memory. I then passed it into an islice grouper as a for loop.
for x in iter(lambda: list(islice(candidates,1000000)),[]):
I then proceeded to perform all of the necessary comparisons in the for loop, and added the resultant dataframe to a dictionary, which I then concatenate at the end for the full list. Python's memory usage hasn't risen above 3GB the entire time.
I would still love some information on how to incorporate dask into this, so I will accept any answer that can provide that (unless the mods think I should open a new question).

Dask: Sorting truly lazily

If I have a dataset with unknown divisions and would like to sort it according to a column and output to Parquet, it seems to me that Dask does at least some of the work twice:
import dask
import dask.dataframe as dd
def my_identity(x):
"""Does nothing, but shows up on the Dask dashboard"""
return x
df = dask.datasets.timeseries()
df = df.map_partitions(my_identity)
df = df.set_index(['name']) # <- `my_identity` is calculated here, as well as other tasks
df.to_parquet('temp.parq') # <- previous tasks seem to be recalculated here
If my_identity was computationally demanding, then recomputing it would be really costly.
Am I correct in my understanding that Dask does some work twice here? Is there any way to prevent that?
The explanation below may not be accurate, but hopefully helps a bit.
Let's try to get into dask's shoes on this. We are asking dask to create an index based on some variable... Dask only works with sorted indexes, so Dask will want to know how to re-arrange data to make it sorted and also what will be the appropriate divisions for the partitions. The first calculation you see is doing that, and dask will store only the parts of calculation necessary for the divisions/data-reshuffling.
Then when we ask Dask to save the data, it computes the variables, shuffles the data (in line with the previous computations) and stores it in corresponding partitions.
How to avoid this? Possible options:
persist before setting the index. Once you persist, dask will compute the variable and keep it on workers, so setting index will refer to the results of that computation. There will still be reshuffling of the data needed). Note that the documentation suggests persisting after setting the index, but that case assumes that the column exists (does not require separate computation).
Sort within partitions, this can be done lazily, but of course it's only an option if you do not need a global sort.
Use plain pandas, this may necessitate some manual chunking of the data (what I tend to use for sorting).

Compare two PySpark dataframes and modify one of them?

I can't find a Sparkified-way to do this, and was hoping some of you data experts out there might be able to help:
I have two dataframes:
1
item_list
[1,2,3,4,5,6,7,0,0]
[1,2,3,4,5,6,7,8,0]
2
item_list
[3,0,0,4,2,6,1,0,0]
I want to return a new dataframe like this. For every non-zero in DF 2, replace it with 1 if DF 1 is non-zero at that index and return a new dataframe.
Result:
item_list
[3,1,1,4,2,6,1,1,0]
This is fairly easy to do in standard python. How can I do this in Spark?
Even though you are using spark it doesn't necessarily mean you have to use only spark methods and process to work around.
I would suggest analyzing a problem and look through the best approachable solution. Since you are using PySpark and you have two lists, you can actually achieve this using python (as you mentioned) easily over spark and it might be the more ideal way to do it in the current scenario.
Spark comes into play when you think a language either pyhton or scala cannot achieve or may achieve but Spark can have some helping libraries which makes your life easy.

Using dask for loop parallelization in nested loops

I am just learning to use dask and read many threads on this forum related to Dask and for loops. But I am still unclear how to apply those solutions to my problem. I am working with climate data that are functions of (time, depth, location). The 'location' coordinate is a linear index such that each value corresponds to a unique (longitude, latitude). I am showing below a basic skeleton of what I am trying to do, assuming var1 and var2 are two input variables. I want to parallelize over the location parameter 'nxy', as my calculations can proceed simultaneously at different locations.
for loc in range(0,nxy): # nxy = total no. of locations
for it in range(0,ntimes):
out1 = expression1 involving ( var1(loc), var2(it,loc) )
out2 = expression2 involving ( var1(loc), var2(it,loc) )
# <a dozen more output variables>
My questions:
(i) Many examples illustrating the use of 'delayed' show something like "delayed(function)(arg)". In my case, I don't have too many (if any) functions, but lots of expressions. If 'delayed' only operates at the level of functions, should I convert each expression into a function and add a 'delayed' in front?
(ii) Should I wrap the entire for loop shown above inside a function and then call that function using 'delayed'? I tried doing something like this but might not be doing it correctly as I did not get any speed-up compared to without using dask. Here's what I did:
def test_dask(n):
for loc in range(0,n):
# same code as before
return var1 # just returning one variable for now
var1=delayed(tast_dask)(nxy)
var1.compute()
Thanks for your help.
Every delayed task adds about 1ms of overhead. So if your expression is slow (maybe you're calling out to some other expensive function), then yes dask.delayed might be a good fit. If not, then you should probably look elsewhere.
In particular, it looks like you're just iterating through a couple arrays and operating element by element. Please be warned that Python is very slow at this. You might want to not use Dask at all, but instead try one of the following approaches:
Find some clever way to rewrite your computation with Numpy expressions
Use Numba
Also, given the terms your using like lat/lon/depth, it may be that Xarray is a good project for you.

Dask map_partitions results in duplicates when reducing and gives wrong results compared to pure pandas

When I use dask to groupby using map_partitions, I obtain duplicated data and wrong results compared to simple pandas groupby. But when I use n_partitons=1, I get the correct results.
Why does this happen? and how can I use multiple partitions and still get the correct results?
my code is
measurements = measurements.repartition(n_partitions=38)
measurements.map_partitions(lambda df : df.groupby(["id",df.time.dt.to_period("M"),
"country","job"]).source.nunique()).compute().reset_index()
In pandas, I do
measurements.groupby(["id",measurements.time.dt.to_period("M"),
"country","job"]).source.nunique().reset_index()
PS: I'm using a local cluster on a single machine.
When you call map_partitions, you say you want to perform that action on each partition. Given that each unique grouping value can occur in multiple partitions, you will get an entry for each group, for each partition in which it is found.
What if there were a way to do groupby across partitions and have the results smartly merged for you automatically? Fortunately, this is exactly what dask does, and you did not need to use map_partitions at all.
measurements.groupby(...).field.nunique().compute()

Categories