Is it possible to include a break in the Pandas apply function?
I have a set of very large dataframes that I need to apply a function to as part of an optimization problem. This seems like the best approach but there's significant daylight between the best-case and worst-case scenarios. Best case, because the dataframe is ordered, the first solution I try works and is the best I'll find in that dataframe. If I could put in a break then I would avoid having to apply the function to the rest of the rows. But worst-case, there's no solution in the dataframe, so I want to run through the whole dataframe as fast as I can and go on to the next one.
Without being able to insert a break in apply, my best-case is terrible. With a lazy iterator, my worst-case is terrible. Is there a way to quickly apply a function to a dataframe but also stop when some criterion is met?
Related
I have few questions for the slicing operation.
in pandas we can do operation as follows -:
df["A"].iloc[0]
df["B"].iloc[-1]
# here df["A"],df["B"] is sorted
as we can't do this (Slicing and Multiple_col_sorting) with Dask (i am not 100% sure), I used another way to do it
df["A"]=df.sort_values(by=['A'])
first=list(df["A"])[0]
df["B"]=df.sort_values(by=['B'])
end=list(df["B"])[-1]
this way is really time-consuming when the dataframe is large, is there any other way to do this operation?
https://docs.dask.org/en/latest/dataframe-indexing.html
https://docs.dask.org/en/latest/array-slicing.html
I tried working with this, but it does not work.
The index or Dask is different than Pandas because Pandas is a global ordering of the data. Dask is indexed from 1 to N for each partition so there are multiple items with index value of 1. This is why iloc on a row is disallowed I think.
For this specifically, use
first: https://docs.dask.org/en/latest/generated/dask.dataframe.DataFrame.first.html
last:
https://docs.dask.org/en/latest/generated/dask.dataframe.DataFrame.last.html
Sorting is a very expensive operation for large dataframes spread across multiple machines, whereas first and last are very parallelizeable operations because it can be done per partition and then executed again among the results of each partition.
It's possible to get almost .iloc-like behaviour with dask dataframes, but it requires having a pass through the whole dataset once. For very large datasets, this might be a meaningful time cost.
The rough steps are: 1) create a unique index that matches the row numbering (modifying this answer to start from zero or using this answer), and 2) swap .iloc[N] for .loc[N].
This won't help with relative syntax like .iloc[-1], however if you know the total number of rows, you can compute the corresponding absolute position to pass into .loc.
This link and others tell me that the Spark groupByKey is not to be used if there is a large number of keys, since Spark shuffles all the keys around. Does the same apply to the groupBy function as well? Or is this something different?
I'm asking this because I want to do what this question tries to do, but I have a very large number of keys. It should be possible to do this without shuffling all the data around by reducing on each node locally, but I can't find the PySpark way to do this (frankly, I find the documentation quite lacking).
Essentially, I am trying to do is:
# Non-working pseudocode
df.groupBy("A").reduce(lambda x,y: if (x.TotalValue > y.TotalValue) x else y)
However, the dataframe API does not offer a "reduce" option. I'm probably misunderstanding what exactly dataframe is trying to achieve.
A DataFrame groupBy followed by an agg will not move the data around unnecessarily, see here for a good example. Hence, there is no need to avoid it.
When using the RDD API, the opposite is true. Here it is preferable to avoid groupByKey and use a reducebyKey or combineByKey where possible. Some situations, however, do require one to use groupByKey.
The normal way to do this type of operation with the DataFrame API is to use groupBy followed by an aggregation using agg. In your example case, you want to find the maximum value for a single column for each group, this can be achived by the max function:
from pyspark.sql import functions as F
joined_df.groupBy("A").agg(F.max("TotalValue").alias("MaxValue"))
In addition to max there are a multitude of functions that can be used in combination with agg, see here for all operations.
The documentation is pretty all over the place.
There has been a lot of optimization work for dataframes. Dataframes has additional information about the structure of your data, which helps with this. I often find that many people recommend dataframes over RDDs due to "increased optimization."
There is a lot of heavy wizardry behind the scenes.
I recommend that you try "groupBy" on both RDDs and dataframes on large datasets and compare the results. Sometimes, you may need to just do it.
Also, for performance improvements, I suggest fiddling (through trial and error) with:
the spark configurations Doc
shuffle.partitions Doc
I am trying to use pandas pd.DataFrame.where as follows:
df.where(cond=mask, other=df.applymap(f))
Where f is a user defined function to operate on a single cell. I cannot use other=f as it seems to produce a different result.
So basically I want to evaluate the function f at all cells of the DataFrame which does not satisfy some condition which I am given as the mask.
The above usage using where is not very efficient as it evaluates f immediately for the entire DataFrame df, whereas I only need to evaluate it at some entries of the DataFrame, which can sometimes be very few specific entries compared to the entire DataFrame.
Is there an alternative usage/approach that could be more efficient in solving this general case?
As you correctly stated, df.applymap(f) is evaluated before df.where(). I'm fairly certain that df.where() is a quick function and is not the bottleneck here.
It's more likely that df.applymap(f) is inefficient, and there's usually a faster way of doing f in a vectorized manner. Having said so, if you do believe this is impossible, and f is itself slow, you could modify f to leave the input unchanged wherever your mask is False. This is most likely going to be really slow though, and you'll definitely prefer trying to vectorize f instead.
If you really must do it element-wise, you could use a NumPy array:
result = df.values
for (i,j) in np.where(mask):
result[i,j] = f(result[i,j])
It's critical that you use a NumPy array for this, rather than .iloc or .loc in the dataframe, because indexing a pandas dataframe is slow.
You could compare the speed of this with .applymap; for the same operation, I don't think .applymap is substantially faster (if at all) than simply a for loop, because all pandas does is run a for loop of its own in Python (maybe Cython? But even that only saves on the overhead, and not the function itself). This is different from 'proper' vectorization, because vector operations are implemented in C.
I am using pyspark, and I call getNumPartitions() to see if I need to repartition and it is dramatically slowng down my code. The code is too large to post here. My code works like this:
I have a for loop that loops through a bunch of functions that will be applied to a DataFrame
Obviously these are applied lazily, so they don't get applied until the end of the for loop.
Many of them are withColumn functions, or pivot functions like this: https://databricks.com/blog/2016/02/09/reshaping-data-with-pivot-in-apache-spark.html
At each iteration, I print out the number of partitions by getNumPartitions()
I was under the impression that this is not an expensive operation...am I mis understanding and is it actually expensive? Or is something else slowing down my code?
Looking at the source for getNumPartitions()...
def getNumPartitions(self):
return self._jrdd.partitions().size()
it should not be that expensive. I suspect that there is something else going on that's causing your slow down.
Here's what I do know:
The list of partitions are cached and so only the first call to partitions() will cause the partitions to be calculated
Spark has to calculate the partitions for each RDD anyway, so it shouldn't add any more time for you to query the count
I need to do an apply on a dataframe using inputs from multiple rows. As a simple example, I can do the following if all the inputs are from a single row:
df['c'] = df[['a','b']].apply(lambda x: awesome stuff, axis=1)
# or
df['d'] = df[['b','c']].shift(1).apply(...) # to get the values from the previous row
However, if I need 'a' from the current row, and 'b' from the previous row, is there a way to do that with apply? I could add a new 'bshift' column and then just use df[['a','bshift']] but it seems there must be a more direct way.
Related but separate, when accessing a specific value in the df, is there a way to combine labeled indexing with integer-offset? E.g. I know the label of the current row but need the row before. Something like df.at['labelIknow'-1, 'a'] (which of course doesn't work). This is for when I'm forced to iterate through rows. Thanks in advance.
Edit: Some info on what I'm doing etc. I have a pandas store containing tables of OHLC bars (one table per security). When doing backtesting, currently I pull the full date range I need for a security into memory, and then resample it into a frequency that makes sense for the test at hand. Then I do some vectorized operations for things like trade entry signals etc. Finally I loop over the data from start to finish doing the actual backtest, e.g. checking for trade entry exit, drawdown etc - this looping part is the part I'm trying to speed up.
This should directly answer your question and let you use apply, although I'm not sure it's ultimately any better than a two-line solution. It does avoid creating extra variables at least.
df['c'] = pd.concat([ df['a'], df['a'].shift() ], axis=1).apply(np.mean,axis=1)
That will put the mean of 'a' values from the current and previous rows into 'c', for example.
This isn't as general, but for simpler cases you can do something like this (continuing the mean example):
df['c'] = ( df['a'] + df['a'].shift() ) / 2
That is about 10x faster than the concat() method on my tiny example dataset. I imagine that's as fast as you could do it, if you can code it in that style.
You could also look into reshaping the data with stack() and hierarchical indexing. That would be a way to get all your variables into the same row but I think it will likely be more complicated than the concat method or just creating intermediate variables via shift().
For the first part, I don't think such a thing is possible. If you update on what you actually want to achieve, I can update this answer.
Also looking at the second part, your data structure seems to be relying an awfully lot on the order of rows. This is typically not how you want to manage your databases. Again, if you tell us what your overall goal is, we may hint you towards a solution (and potentially a better way to structure the data base).
Anyhow, one way to get the row before, if you know a given index label, is to do:
df.ix[:'labelYouKnow'].iloc[-2]
Note that this is not the optimal thing to do efficiency-wise, so you may want to improve your your db structure in order to prevent the need for doing such things.