I am using pyspark, and I call getNumPartitions() to see if I need to repartition and it is dramatically slowng down my code. The code is too large to post here. My code works like this:
I have a for loop that loops through a bunch of functions that will be applied to a DataFrame
Obviously these are applied lazily, so they don't get applied until the end of the for loop.
Many of them are withColumn functions, or pivot functions like this: https://databricks.com/blog/2016/02/09/reshaping-data-with-pivot-in-apache-spark.html
At each iteration, I print out the number of partitions by getNumPartitions()
I was under the impression that this is not an expensive operation...am I mis understanding and is it actually expensive? Or is something else slowing down my code?
Looking at the source for getNumPartitions()...
def getNumPartitions(self):
return self._jrdd.partitions().size()
it should not be that expensive. I suspect that there is something else going on that's causing your slow down.
Here's what I do know:
The list of partitions are cached and so only the first call to partitions() will cause the partitions to be calculated
Spark has to calculate the partitions for each RDD anyway, so it shouldn't add any more time for you to query the count
Related
Is it possible to include a break in the Pandas apply function?
I have a set of very large dataframes that I need to apply a function to as part of an optimization problem. This seems like the best approach but there's significant daylight between the best-case and worst-case scenarios. Best case, because the dataframe is ordered, the first solution I try works and is the best I'll find in that dataframe. If I could put in a break then I would avoid having to apply the function to the rest of the rows. But worst-case, there's no solution in the dataframe, so I want to run through the whole dataframe as fast as I can and go on to the next one.
Without being able to insert a break in apply, my best-case is terrible. With a lazy iterator, my worst-case is terrible. Is there a way to quickly apply a function to a dataframe but also stop when some criterion is met?
While migrating some code from Pandas to Dask I found an enormous performance difference between modifying a Dask dataframe by calling DataFrame.assign() with multiple columns vs modifying it with multiple DataFrame.__setitem__() (aka dataframe[x]=y) calls.
With imports
import pandas, dask, cProfile
For a Dask dataframe defined as:
dd = dask.dataframe.from_pandas(pandas.DataFrame({'a':[1]}), npartitions=1)
cProfile.run('for i in range(100): dd["c"+str(i)]=dd["a"]+i')
takes 1.436 seconds
while
cProfile.run('dd.assign(**{"c"+str(i):dd["a"]+i for i in range(100)})')
only takes 0.211 seconds. A 6.8X difference.
I have already tried looking at the stats with pyprof2calltree but couldn't make sense of it.
What explains this difference? And more importantly, is there any way to get the assign performance without having to refactor code that is using dd[x]=y repeatedly?
This may not matter or happen for large datasets, I haven't checked, but it does for a single row (why I care about Dask being fast for single rows is a separate topic).
For context, there is a difference in Pandas too but it is a lot smaller:
df = pandas.DataFrame({'a':[1]})
cProfile.run('for i in range (100): df["c"+str(i)]=df["a"]+i')
takes 0.116 seconds.
cProfile.run('df.assign(**{"c"+str(i):df["a"]+i for i in range(100)})')
takes 0.096 seconds. Just 1.2X.
Two main reasons:
The for loop generates a larger task graph (one new layer per item in the loop), compared to the single additional task from the assign.
DataFrame.__setitem__ is actually implemented in terms of assign: https://github.com/dask/dask/blob/366c7998398bc778c4aa5f4b6bb22c25b584fbc1/dask/dataframe/core.py#L3424-L3432, so you end up calling the same code, just many more times. Each assign is associated with a copy in pandas.
I have already tried looking at the stats with pyprof2calltree but couldn't make sense of it.
Profilers like this (built on cProfile) aren't well suited for profiling parallel code like Dask.
I am just learning to use dask and read many threads on this forum related to Dask and for loops. But I am still unclear how to apply those solutions to my problem. I am working with climate data that are functions of (time, depth, location). The 'location' coordinate is a linear index such that each value corresponds to a unique (longitude, latitude). I am showing below a basic skeleton of what I am trying to do, assuming var1 and var2 are two input variables. I want to parallelize over the location parameter 'nxy', as my calculations can proceed simultaneously at different locations.
for loc in range(0,nxy): # nxy = total no. of locations
for it in range(0,ntimes):
out1 = expression1 involving ( var1(loc), var2(it,loc) )
out2 = expression2 involving ( var1(loc), var2(it,loc) )
# <a dozen more output variables>
My questions:
(i) Many examples illustrating the use of 'delayed' show something like "delayed(function)(arg)". In my case, I don't have too many (if any) functions, but lots of expressions. If 'delayed' only operates at the level of functions, should I convert each expression into a function and add a 'delayed' in front?
(ii) Should I wrap the entire for loop shown above inside a function and then call that function using 'delayed'? I tried doing something like this but might not be doing it correctly as I did not get any speed-up compared to without using dask. Here's what I did:
def test_dask(n):
for loc in range(0,n):
# same code as before
return var1 # just returning one variable for now
var1=delayed(tast_dask)(nxy)
var1.compute()
Thanks for your help.
Every delayed task adds about 1ms of overhead. So if your expression is slow (maybe you're calling out to some other expensive function), then yes dask.delayed might be a good fit. If not, then you should probably look elsewhere.
In particular, it looks like you're just iterating through a couple arrays and operating element by element. Please be warned that Python is very slow at this. You might want to not use Dask at all, but instead try one of the following approaches:
Find some clever way to rewrite your computation with Numpy expressions
Use Numba
Also, given the terms your using like lat/lon/depth, it may be that Xarray is a good project for you.
This link and others tell me that the Spark groupByKey is not to be used if there is a large number of keys, since Spark shuffles all the keys around. Does the same apply to the groupBy function as well? Or is this something different?
I'm asking this because I want to do what this question tries to do, but I have a very large number of keys. It should be possible to do this without shuffling all the data around by reducing on each node locally, but I can't find the PySpark way to do this (frankly, I find the documentation quite lacking).
Essentially, I am trying to do is:
# Non-working pseudocode
df.groupBy("A").reduce(lambda x,y: if (x.TotalValue > y.TotalValue) x else y)
However, the dataframe API does not offer a "reduce" option. I'm probably misunderstanding what exactly dataframe is trying to achieve.
A DataFrame groupBy followed by an agg will not move the data around unnecessarily, see here for a good example. Hence, there is no need to avoid it.
When using the RDD API, the opposite is true. Here it is preferable to avoid groupByKey and use a reducebyKey or combineByKey where possible. Some situations, however, do require one to use groupByKey.
The normal way to do this type of operation with the DataFrame API is to use groupBy followed by an aggregation using agg. In your example case, you want to find the maximum value for a single column for each group, this can be achived by the max function:
from pyspark.sql import functions as F
joined_df.groupBy("A").agg(F.max("TotalValue").alias("MaxValue"))
In addition to max there are a multitude of functions that can be used in combination with agg, see here for all operations.
The documentation is pretty all over the place.
There has been a lot of optimization work for dataframes. Dataframes has additional information about the structure of your data, which helps with this. I often find that many people recommend dataframes over RDDs due to "increased optimization."
There is a lot of heavy wizardry behind the scenes.
I recommend that you try "groupBy" on both RDDs and dataframes on large datasets and compare the results. Sometimes, you may need to just do it.
Also, for performance improvements, I suggest fiddling (through trial and error) with:
the spark configurations Doc
shuffle.partitions Doc
I am having a dictionary of pandas Series, each with their own index and all containing float numbers.
I need to create a pandas DataFrame with all these series, which works fine by just doing:
result = pd.DataFrame( dict_of_series )
Now, I actually have to do this a large amount of time, along with some heavy calculation (we're in a Monte-Carlo engine).
I noticed that the part where my code was spending the most time was over this line. Obviously, this is if I sum up all the times it's been called.
I thought about caching the result but unfortunately the dict_of_series is almost all the time different.
I guess that what takes time is obviously the constructor has to build the global index and fill the holes and maybe there is simply no way around it, but I'm wondering if I'm not missing something obvious which slows down the process, or if there is something smarter I could do to speed it up.
Has anybody had the same experience?